Benchmark

Source: BENCHMARK.md (ingested 2026-03-28)

Predictive Maintenance ML Pipeline Benchmark

Domain: Machine Learning Engineering Estimated human time: 4-8 hours Predicted Gemini 3.0 Pro score: 20-40 / 100

benchmark_ml_predictive_maintenance/
|-- BENCHMARK.md              <-- THIS FILE (all 5 fields)
|-- data_generator.py         <-- Generates the synthetic dataset
|-- golden_solution.py        <-- Correct ML pipeline (run to verify)
|-- golden_results.json       <-- Exact metrics from golden solution
|-- sensor_data.csv           <-- Generated sensor data (64,800 rows)
|-- failures.csv              <-- 100 failure events
|-- machine_metadata.csv      <-- 30 machines

FIELD 1: PROMPT

Copy everything between the --- BEGIN PROMPT --- and --- END PROMPT --- markers below and paste it into Gemini 3.0 Pro.

--- BEGIN PROMPT ---

Predictive Maintenance ML Pipeline Challenge

Background

You are an ML engineer at an industrial manufacturing company. The maintenance team has provided 90 days of sensor data from 30 machines and a log of equipment failures. Your task is to build a complete, production-ready ML pipeline that predicts equipment failures 24 hours in advance.

Data

Run the following Python code to generate the dataset:

import numpy as np
import pandas as pd
from datetime import datetime, timedelta

def generate_dataset(seed=42):
    rng = np.random.default_rng(seed)
    n_machines = 30
    n_hours = 90 * 24
    start = datetime(2025, 1, 1)

    all_failures = []
    for m in range(n_machines):
        n_f = rng.integers(2, 6)
        candidates = np.arange(72, n_hours - 24)
        rng.shuffle(candidates)
        chosen = []
        for c in candidates:
            if all(abs(int(c) - ch) >= 72 for ch in chosen):
                chosen.append(int(c))
            if len(chosen) == n_f:
                break
        for fh in sorted(chosen):
            ft = rng.choice(["bearing_wear", "overheating", "seal_degradation"])
            all_failures.append({"machine_id": m + 1, "hour": fh, "failure_type": ft})

    sensor_names = ["vibration", "temperature", "pressure", "acoustic_db", "rpm"]
    base_means = np.array([50.0, 75.0, 30.0, 62.0, 1500.0])
    base_stds = np.array([5.0, 3.0, 2.0, 4.0, 50.0])
    noise_scales = np.array([1.5, 0.8, 0.5, 2.0, 10.0])

    all_records = []
    timestamps = [start + timedelta(hours=h) for h in range(n_hours)]
    for m in range(n_machines):
        m_id = m + 1
        m_failures = [f["hour"] for f in all_failures if f["machine_id"] == m_id]
        bases = rng.normal(base_means, base_stds)
        innovations = rng.normal(0, 1, (n_hours, 5)) * noise_scales
        sensors = np.empty((n_hours, 5))
        sensors[0] = bases
        for t in range(1, n_hours):
            sensors[t] = 0.95 * sensors[t-1] + 0.05 * bases + innovations[t]
        for fh in m_failures:
            for t in range(max(0, fh - 48), fh):
                dt = fh - t
                if dt <= 48: sensors[t, 0] += (48 - dt) * 0.25
                if dt <= 24: sensors[t, 1] += (24 - dt) ** 1.2 * 0.12
                if dt <= 36: sensors[t, 2] -= (36 - dt) * 0.06
                if dt <= 12: sensors[t, 3] += (12 - dt) * 0.8
        mask = rng.random((n_hours, 5)) < 0.025
        sensors = np.where(mask, np.nan, np.round(sensors, 2))
        for t in range(n_hours):
            all_records.append({"machine_id": m_id, "timestamp": timestamps[t],
                "vibration": sensors[t,0], "temperature": sensors[t,1],
                "pressure": sensors[t,2], "acoustic_db": sensors[t,3], "rpm": sensors[t,4]})

    sensor_df = pd.DataFrame(all_records)
    failure_df = pd.DataFrame(all_failures)
    failure_df["timestamp"] = [start + timedelta(hours=int(h)) for h in failure_df["hour"]]
    failure_df = failure_df.drop(columns=["hour"])
    meta_df = pd.DataFrame({"machine_id": range(1, n_machines+1),
        "model_type": rng.choice(["TypeA","TypeB","TypeC"], size=n_machines),
        "install_year": rng.integers(2015, 2024, size=n_machines),
        "location": rng.choice(["Plant_North","Plant_South","Plant_East"], size=n_machines)})
    return sensor_df, failure_df, meta_df

sensor_df, failure_df, meta_df = generate_dataset()

This produces:

sensor_df: 64,800 rows (30 machines x 2,160 hours), 5 sensor columns, ~2.5% missing values
failure_df: ~100 failure events with timestamps and failure types
meta_df: Machine metadata (model type, install year, location)

Task

Build a complete ML pipeline that, for each machine at each hour, predicts whether the machine will fail within the next 24 hours. Your pipeline must be production-ready.

Requirements

Data exploration -- Summarize the data, identify quality issues, handle missing values.
Label construction -- Create binary labels (1 = failure within next 24 hours, 0 = no failure). Document your labeling approach.
Feature engineering -- Create meaningful features from the sensor time series (rolling statistics, rate-of-change, cross-sensor features). Explain your feature design choices.
Train/validation/test split -- Split the data appropriately for model development and evaluation. Justify your splitting strategy.
Model training -- Train a classification model. Handle the class imbalance in the data (~96% of rows are negative). Justify your approach.
Evaluation -- Report model performance using appropriate metrics. Explain why you chose those metrics.
Threshold optimization -- The business context is:
- Cost of a missed failure (false negative): $50,000 (unplanned downtime)
- Cost of a false alarm (false positive): $2,000 (unnecessary inspection) Optimize your decision threshold to minimize total business cost.
Complete, runnable Python code -- All code must run end-to-end and reproduce your results.

Deliverables

Provide:

Complete Python code for the entire pipeline
Model performance metrics on the test set
Feature importance analysis
Discussion of potential data leakage and how you prevented it
Deployment recommendations

--- END PROMPT ---

FIELD 2: CONVERSATION LINK

[Paste Gemini 3.0 Pro shareable conversation link here after testing]

After pasting the prompt, send follow-up hints F1-F5 (see Field 5 below), then generate a shareable link.

FIELD 3: GOLDEN SOLUTION

See golden_solution.py for the complete runnable code. Key results:

Correct Pipeline Summary

Step 1: Data Exploration

64,800 sensor readings from 30 machines over 90 days
100 failure events, 3.7% positive label rate
~2.5% missing values per sensor (handled via forward-fill per machine)

Step 2: Label Construction

For each (machine_id, timestamp), label = 1 if that machine has a failure within the next 24 hours. Labels use future failure events but NOT future sensor values.

2,400 positive labels (3.7%), 62,400 negative (96.3%)

Step 3: Feature Engineering (39 features, backward-looking only)

Per sensor: rolling mean, std, min, max, range over past 24h
Rate of change: mean of last 6h minus mean of previous 6h
Cross-sensor: vibration/temperature ratio, pressure change rate
Time: hour of day, day of week
Raw sensor values

Critical: All rolling windows use only past data (pd.rolling() looks backward by default). Features are computed per machine via groupby.

Step 4: Temporal Split (NOT random)

Train: Jan 1 - Feb 24 (60%, 39,600 rows)
Validation: Feb 25 - Mar 14 (20%, 12,960 rows)
Test: Mar 15 - Mar 31 (20%, 12,240 rows)

No random shuffling. No overlap. The model never sees future data during training.

Step 5: Model Training

GradientBoostingClassifier (200 trees, depth=5)
Class imbalance handled via sample weights (positive class weighted 27.4x)

Step 6: Evaluation (default threshold = 0.5)

| Metric | Validation | Test | |--------|-----------|------| | Precision | 0.421 | -- | | Recall | 0.244 | -- | | F1 | 0.309 | -- | | PR-AUC | 0.299 | -- | | Accuracy | 0.953 | -- |

Accuracy is MISLEADING (baseline = 0.957 by predicting all-negative).

Step 7: Threshold Optimization (on validation set)

Optimal threshold: 0.050 (minimizes business cost)
Business cost at optimal: $14.5M on validation set
Note: threshold is far below default 0.5 because missed failures ($50K) cost 25x more than false alarms ($2K)

Step 8: Final Test Results (threshold = 0.050)

| Metric | Value | |--------|-------| | Precision | 0.150 | | Recall | 0.654 | | F1 | 0.244 | | PR-AUC | 0.305 | | Accuracy | 0.850 (misleading) | | Business cost | $11.2M |

Step 9: Top Features

vibration (0.212)
temperature (0.096)
pressure (0.056)
vibration_max_24h (0.055)
temperature_min_24h (0.050)

Step 10: Leakage Check

No temporal overlap between train/val/test
Test F1 = 0.244 (well below 0.80 leakage threshold)
Model is learning real degradation patterns, not temporal artifacts

FIELD 4: RUBRIC / UNIT TESTS

Rubric (100 Points)

| # | Criterion | Points | Primary Trap | |---|-----------|--------|-------------| | R01 | Data exploration and missing value handling | 5 | -- | | R02 | Correct label construction | 8 | Look-ahead bias | | R03 | Feature engineering quality | 10 | Future data in features | | R04 | TEMPORAL train/test split | 20 | Random split = leakage | | R05 | Class imbalance handling | 10 | No handling = predict all-negative | | R06 | Appropriate evaluation metrics | 12 | Using accuracy as primary metric | | R07 | Threshold optimization on validation set | 12 | Optimizing on test set | | R08 | Business cost analysis | 8 | Ignoring cost asymmetry | | R09 | Leakage awareness and prevention | 10 | No discussion of leakage | | R10 | Code quality and reproducibility | 5 | -- | | | TOTAL | 100 | |

R01: Data Exploration (5 pts)

5: Summarizes shape, missing values, class distribution, sensor distributions
3: Partial exploration
0: No exploration

R02: Label Construction (8 pts)

8: Correct temporal labeling (label = 1 if failure within next 24h), clearly documented
5: Labels constructed correctly but not well documented
2: Labels use information from current or past sensor values inappropriately
0: No label construction shown

R03: Feature Engineering (10 pts)

10: Rolling features computed per-machine using ONLY past data, multiple feature types, cross-sensor features
7: Good features but computed on full dataset before splitting (subtle leakage)
4: Basic features only
0: No feature engineering

R04: Temporal Split (20 pts) -- MOST CRITICAL

20: Uses strict temporal split (train period < val period < test period), no random shuffling, clearly justified
12: Uses temporal split but with some overlap or incorrect boundary handling
5: Mentions time-series issues but still uses random split
0: Uses train_test_split(random_state=42) or similar random split

Scoring note: This is the single most discriminative criterion. Random splitting on time-series data causes massive temporal leakage where nearby timestamps from the same machine appear in both train and test. Award maximum 5 if random split is used, regardless of other quality.

DIAGNOSTIC: If the model reports test F1 > 0.70, it almost certainly used random splitting. Correct F1 should be 0.20-0.40.

R05: Class Imbalance Handling (10 pts)

10: Uses class weights, oversampling (SMOTE), or cost-sensitive learning; explains the approach
6: Uses some imbalance technique without explanation
3: Mentions imbalance but doesn't handle it
0: Ignores class imbalance

R06: Evaluation Metrics (12 pts)

12: Reports precision, recall, F1, and PR-AUC; explicitly notes accuracy is misleading for imbalanced data; reports confusion matrix
8: Reports F1 or precision/recall but not PR-AUC, or doesn't note accuracy limitation
4: Reports accuracy as main metric but also shows other metrics
0: Reports only accuracy

R07: Threshold Optimization (12 pts)

12: Optimizes threshold on validation set using business costs; shows cost analysis; applies optimized threshold to test set
8: Optimizes threshold but on test set (leakage)
4: Mentions threshold optimization but doesn't implement it
0: Uses default 0.5 threshold

R08: Business Cost Analysis (8 pts)

8: Computes total business cost (FN * $50K + FP * $2K), shows cost at multiple thresholds, discusses cost-recall tradeoff
5: Mentions business costs but doesn't compute them systematically
0: Ignores business costs

R09: Leakage Awareness (10 pts)

10: Explicitly discusses potential sources of leakage (temporal, feature, label), shows how each is prevented, validates with sanity checks
6: Mentions leakage but doesn't comprehensively address all sources
3: Brief mention of leakage without concrete prevention steps
0: No mention of data leakage

R10: Code Quality (5 pts)

5: Complete, runnable, well-organized code that reproduces all results
3: Runs but has issues
0: Non-runnable or incomplete

Automated Unit Tests

| ID | Test | Expected | Type | |----|------|----------|------| | T01 | Split is temporal, not random | Train timestamps < val < test | HARD | | T02 | No timestamp overlap between splits | Zero overlap | HARD | | T03 | Reports F1 or precision-recall (not just accuracy) | Present | HARD | | T04 | Test F1 < 0.70 (no leakage) | F1 between 0.10 and 0.70 | HARD | | T05 | Class imbalance technique used | class_weight / SMOTE / sample_weight | HARD | | T06 | Threshold != 0.5 (optimized) | Threshold adjusted | HARD | | T07 | Threshold optimized on val set, not test | Val set used | HARD | | T08 | Missing values handled | fillna / imputation present | SOFT | | T09 | Rolling features use groupby(machine_id) | Per-machine computation | HARD | | T10 | Business cost computed | FN50000 + FP2000 shown | HARD | | T11 | Accuracy noted as misleading | Discussed | SOFT | | T12 | Feature importance reported | Top features shown | SOFT |

Diagnostic rule: If T04 fails (F1 > 0.70), the model has temporal leakage. This single test catches the most critical failure.

FIELD 5: FAILURE ANALYSIS

Predicted Gemini 3.0 Pro Performance: 20-40 / 100 (Weak)

Failure 1: Random Train/Test Split (20 pts at risk) -- CRITICAL

Prediction: Gemini will write X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42). This is the #1 most common ML pattern in training data.

Why it fails: For time-series sensor data, random splitting places adjacent timestamps from the same machine in both train and test. The model memorizes the temporal context instead of learning degradation patterns. Test F1 jumps to 0.80+ (vs correct 0.24).

Failure mode: FM-008 (Inappropriate Generalization from Common Patterns). train_test_split is appropriate for i.i.d. data but catastrophically wrong for time-series.

Failure 2: Features Computed Before Split (10 pts at risk)

Prediction: Gemini will compute all rolling features on the full dataset, THEN split. This leaks test-set statistics into training features.

Why it fails: Rolling means/stds computed on all data include future values. The correct approach is to split first, then compute features within each split (or use a pipeline that respects temporal ordering).

Failure mode: FM-001 (Constraint Propagation). The constraint "no future data in training features" must be maintained across the entire pipeline.

Failure 3: Accuracy as Primary Metric (12 pts at risk)

Prediction: Gemini will report accuracy (0.96+) as the main success metric.

Why it fails: With 96.3% negative labels, a model that predicts "no failure" for every row achieves 96.3% accuracy. Accuracy is meaningless here. The correct metrics are precision-recall, F1, and PR-AUC.

Failure mode: FM-008 (Inappropriate Generalization). Accuracy is the default metric in most ML tutorials, but it's useless for imbalanced classification.

Failure 4: No Threshold Optimization (12 pts at risk)

Prediction: Gemini will use the default 0.5 decision threshold, or optimize on the test set.

Why it fails: The optimal threshold (0.05) is far from 0.50 because missed failures cost 25x more than false alarms. Using 0.50 gives recall of only 24% (misses 76% of failures). Optimizing on the test set is data leakage.

Failure mode: FM-002 (Assumption Tracking). The assumption that 0.5 is a good threshold must be questioned given the cost asymmetry.

Failure 5: No Leakage Discussion (10 pts at risk)

Prediction: Gemini will not explicitly discuss data leakage risks or prevention strategies.

Why it fails: Leakage prevention is a meta-cognitive step -- the model must reason about its OWN pipeline for correctness, not just produce code that runs.

Failure mode: FM-002 (Assumption Tracking). The implicit assumption "my pipeline is correct" must be verified.

Expected Score Breakdown

| Criterion | Predicted Score | Reason | |-----------|----------------|--------| | R01 Data exploration | 3-5 / 5 | LLMs are decent at EDA | | R02 Label construction | 5-8 / 8 | Usually correct but may not document well | | R03 Feature engineering | 4-7 / 10 | Will compute on full data before split | | R04 Temporal split | 0-5 / 20 | Will use random split | | R05 Class imbalance | 3-6 / 10 | May use class_weight but not explain | | R06 Evaluation metrics | 0-4 / 12 | Will report accuracy as primary | | R07 Threshold optimization | 0-4 / 12 | Will use 0.5 or optimize on test | | R08 Business cost | 0-3 / 8 | May mention but not compute | | R09 Leakage awareness | 0-3 / 10 | Unlikely to discuss | | R10 Code quality | 3-5 / 5 | Will produce runnable code | | TOTAL | 18-50 / 100 | |

Follow-Up Hints for Recovery Testing

F1 -- Temporal Splitting (HIGH diagnostic value)

"You used train_test_split() which randomly shuffles the data. For time-series sensor data, this means timestamps from Tuesday and Thursday of the same week could end up in train and test respectively. The model can memorize temporal patterns instead of learning degradation signals. How should you split time-series data to prevent this temporal leakage?"

Expected recovery: Likely to switch to temporal split after this hint, recovering 10-15 points on R04.

F2 -- Feature Leakage (HIGH diagnostic value)

"You computed rolling statistics on the entire dataset before splitting into train and test. This means the rolling mean at time T in the training set includes values from time T+1, T+2, etc. that might be in the test set. How can you restructure the pipeline to prevent this?"

Expected recovery: May restructure but often gets confused about the correct ordering. Partial recovery 3-7 points on R03.

F3 -- Accuracy Is Misleading (MEDIUM diagnostic value)

"Your model achieves 96% accuracy. But the dataset has only 3.7% positive labels. What accuracy would a model achieve if it simply predicted 'no failure' for every single row? What metrics are more appropriate for highly imbalanced binary classification?"

Expected recovery: High -- will likely switch to F1/precision-recall. Recovery 6-10 points on R06.

F4 -- Threshold Optimization (MEDIUM diagnostic value)

"You're using a 0.5 decision threshold. Given that a missed failure costs $50,000 (25x more than a false alarm at $2,000), should the threshold be higher or lower than 0.5? How would you determine the optimal threshold, and on which data split should you optimize it?"

Expected recovery: Medium -- may optimize threshold but might do it on test set. Recovery 4-8 points on R07/R08.

F5 -- Leakage Sanity Check (LOW diagnostic value)

"Run your pipeline with the temporal split and with a random split. Compare the test F1 scores. If the random split gives F1 > 0.70 but the temporal split gives F1 < 0.40, what does this tell you about data leakage?"

Expected recovery: Provides the diagnostic framework but requires the model to execute the comparison. Recovery 3-6 points on R09.

Expected Recovery Trajectory

| Stage | Estimated Score | Recovery | |-------|----------------|----------| | Initial | ~30 / 100 | -- | | After F1 (temporal split) | ~45 | +15 | | After F2 (feature leakage) | ~50 | +5 | | After F3 (metrics) | ~58 | +8 | | After F4 (threshold) | ~64 | +6 | | After F5 (leakage check) | ~68 | +4 | | Final | ~55-70 / 100 | Recovery rate: ~60% |

Cross-Model Predictions

| Model | Split | Metrics | Threshold | Leakage Aware | Total | |-------|-------|---------|-----------|---------------|-------| | Gemini 3.0 Pro | Random (0) | Accuracy (0) | Default (0) | No (0) | 20-40 | | GPT-4o | Random (0) | Mixed (4) | Partial (4) | Partial (3) | 25-45 | | Claude Sonnet 4 | Temporal (12) | Proper (8) | Val-set (8) | Yes (6) | 45-65 | | DeepSeek-V3 | Random (0) | Accuracy (0) | Default (0) | No (0) | 15-35 |

Common failure across all models: Random train/test splitting. This is the single most likely failure because train_test_split() appears in virtually every ML tutorial in training data.