Provenance: Extracted from Professional/Predictive_Maintenance_Pipeline_for_Cost_Optimization.ipynb (44 KB Jupyter notebook, 15 cells). The notebook contains a prompt, an LLM response (solution), a human golden solution, a grading rubric, and a failure analysis report. Structured as an AI evaluation task demonstrating ML pipeline design competence.

Predictive Maintenance ML Pipeline for Cost Optimization

1. Problem Statement

Fleet of 50 industrial machines with hourly sensor data (temperature, vibration, pressure, RPM). ~4% of observations precede failure within 24 hours. Cost asymmetry: missed failure = $50,000, false alarm = $2,000 (25:1 ratio). 8% missing data rate across all sensors.

Goal: Build a cost-sensitive predictive maintenance model with proper evaluation and business case.

2. Data Generation

Synthetic fleet data generator producing 50 machines x 180 days of hourly readings:

import numpy as np
import pandas as pd

def generate_fleet_data(seed=42):
    rng = np.random.RandomState(seed)
    n_machines, hours = 50, 24 * 180
    # Per machine: temperature, vibration, pressure, RPM with random walk + noise
    # Failure events via geometric distribution (p=0.003) with 48-hour cooldown
    # Pre-failure signature: temp +3-8, vibration +0.15-0.4, pressure -1.5-4
    # 8% NaN injection across all sensors
    # Returns DataFrame with columns: timestamp, machine_id, temperature,
    #   vibration, pressure, rpm, failure_within_24h

Key data characteristics:

Random walk base signals with noise overlay
Failures inject detectable anomalies in 24-hour pre-failure window
8% missing values injected randomly across all sensor channels
~4% positive class rate (failure within 24 hours)

3. LLM Solution (Scored 0/20 — Critical Failure)

The LLM produced a working solution using HistGradientBoostingClassifier with:

Proper chronological train/test split
Rolling window feature engineering (12-hour baselines, spike features)
Native NaN handling (no imputation needed)
Cost-based threshold optimization via financial simulation

Critical flaw: Threshold was tuned directly on the test set (holdout leakage), making all reported savings unreliable and optimistic.

Financial Simulation Engine (from LLM solution)

def simulate_financials(df, threshold, prob_col, label_col,
                        cost_miss=50_000, cost_alarm=2_000):
    # Sweeps probability thresholds 0.01-0.99
    # Simulates operational reality with 24-hour cooldown after alarm
    # Calculates total cost = (FN * $50K) + (FP * $2K)
    # Finds threshold minimizing total business cost

4. Golden Solution (Correct Methodology)

Uses GradientBoostingClassifier with strict three-way split: Train / Validation / Test.

Key Differences from LLM Solution

| Aspect | LLM Solution | Golden Solution | |--------|-------------|-----------------| | Data split | Train / Test (2-way) | Train / Val / Test (3-way) | | Threshold tuning | On test set (LEAKAGE) | On validation set only | | Test evaluation | Contaminated by tuning | Clean, single evaluation | | Final score | 0/20 (critical fail) | Full marks |

Feature Engineering Pipeline

# Per-machine, backward-looking features:
# - 12-hour rolling mean/std for temperature, vibration, pressure, rpm
# - Spike features: (current - rolling_mean) / rolling_std
# - Hour-of-day, day-of-week temporal features
# - Forward-fill NaN per machine before feature computation

Correct Threshold Optimization

# 1. Train model on train set
# 2. Generate probabilities on VALIDATION set
# 3. Sweep thresholds, compute business cost on validation set
# 4. Select best_threshold = argmin(validation_cost)
# 5. Evaluate ONCE on test set with frozen threshold
# 6. Report final metrics (precision, recall, F1, AP, confusion matrix)

5. Grading Rubric (20 Points)

| Area | Points | Criteria | |------|--------|----------| | Leakage prevention | 5 | Chronological split; backward-looking features; temporal ordering acknowledged | | Feature engineering | 4 | Per-machine computation; safe NaN handling; reasonable for 24-hour horizon | | Imbalance & cost | 4 | Accounts for 4% positive rate; cost matrix reflects $50K vs $2K | | Evaluation protocol | 5 | Threshold tuned on validation (not test); business cost computed as (FN*$50K)+(FP*$2K) | | Completeness | 2 | End-to-end runnable; key decisions explained |

Critical Fail Conditions (Automatic Zero)

Random train/test split on time-series data
Threshold tuned on test set (data leakage)
Accuracy used as primary optimization metric
Future data used in feature calculations (look-ahead bias)
Asymmetric business cost ignored or never computed

6. Failure Analysis

LLM Score Breakdown:

Leakage prevention: 5/5 (proper chronological split, backward-looking features)
Feature engineering: 4/4 (good use of groupby for machine-specific stats)
Imbalance & cost: 4/4 (correctly modeled asymmetric costs)
Evaluation protocol: 0/5 (threshold tuned on holdout set)
Completeness: 2/2 (runnable code, well-explained)

Subtotal: 15/20, with critical failure penalty: 0/20.

Root Cause: The LLM splits data into Train and Test, then runs threshold sweep directly on test data to find "optimal" threshold, reporting those metrics as final savings. This is holdout leakage — the reported savings are biased and optimistic.

Key Takeaway: The LLM understands cost-sensitive learning concepts and temporal splitting but fails to implement proper three-way evaluation protocol. This represents a systematic weakness in LLM-generated ML pipelines: the model knows what to do but makes subtle methodological errors in how it's done.