Provenance: Ingested from C:\Users\mesha\Downloads\BENCHMARK_COMPLETE.md (~31 KB) and C:\Users\mesha\Downloads\BENCHMARK.md (~21 KB) on 2026-03-28. BENCHMARK_COMPLETE.md contains the full IFRS 9 ECL benchmark submission package (all 5 Mercor fields inline). BENCHMARK.md contains the Predictive Maintenance ML Pipeline benchmark. Both are complete LLM evaluation benchmarks with task prompts, golden solutions, rubrics, automated unit tests, and failure analysis.

Part 1: IFRS 9 ECL Benchmark — Complete Submission Package

Directory Map

All files are in: mercor-llm-failsafe/outputs/benchmark_ifrs9_ecl/

benchmark_ifrs9_ecl/
|
|-- BENCHMARK_COMPLETE.md        <-- THIS FILE (index + all 5 Mercor fields inline)
|
|-- task_prompt.md               <-- Field 1: Copy-paste this into Gemini
|-- golden_solution_narrative.md <-- Field 3: Step-by-step correct answer
|-- golden_solution.py           <-- Runnable code that produces exact numbers
|-- golden_results.json          <-- Exact numerical outputs (auto-generated)
|-- rubric.md                    <-- Field 4: 100-point scoring rubric + 14 unit tests
|-- follow_ups.md                <-- 6 progressive hints for recovery testing
|-- failure_analysis.md          <-- Field 5: Why Gemini fails + cross-model comparison
|-- mercor_export.md             <-- Summary/overview of all 5 fields

Workflow

Open task_prompt.md, copy its full content, paste into Gemini 3.0 Pro
Save Gemini's response
Score against rubric.md (14 unit tests, 100-point scale)
Send hints from follow_ups.md (F1 through F6), record recovery
Get shareable Gemini conversation link -> paste into Field 2
Write final failure_analysis.md with actual (not just predicted) results

FIELD 1: TASK PROMPT

IFRS 9 Expected Credit Loss -- Commercial Real Estate Portfolio

Memorandum

TO: Credit Risk Analytics Team FROM: Sarah Chen, Chief Risk Officer RE: Q1 2026 IFRS 9 Expected Credit Loss Computation -- CRE Portfolio DATE: March 31, 2026 CLASSIFICATION: Internal -- Regulatory Reporting

1. Background and Objectives

The bank is required to compute Expected Credit Loss (ECL) provisions under IFRS 9 for its commercial real estate (CRE) loan portfolio as of the Q1 2026 reporting date (March 31, 2026). The ECL computation must comply with IFRS 9.5.5 requirements, including forward-looking macroeconomic scenarios and staging based on significant increases in credit risk (SICR).

This portfolio consists of 20 CRE loans across three property segments (Office, Retail, Industrial). You are to compute the full ECL provision using the bank's approved methodology described in Section 5.

This analysis feeds directly into the bank's quarterly regulatory filing. Accuracy and auditability are paramount.

2. Portfolio Data

Reporting date: March 31, 2026. All balances in USD.

| Loan ID | Balance ($) | Origination Date | Maturity Date | Orig Rating | Current Rating | Segment | LTV (%) | Interest Rate (%) | Rate Type | Undrawn Commitment ($) | Collateral Value ($) | ESG Risk Score | |---------|------------|------------------|---------------|-------------|----------------|---------|---------|-------------------|-----------|----------------------|---------------------|----------------| | L01 | 5,000,000 | 2021-03-15 | 2031-03-15 | 2 | 2 | Office | 65.0 | 4.50 | Fixed | 0 | 7,692,308 | 72 | | L02 | 3,200,000 | 2022-06-01 | 2029-06-01 | 3 | 4 | Retail | 72.0 | 5.20 | Fixed | 500,000 | 4,444,444 | 65 | | L03 | 8,500,000 | 2020-01-10 | 2032-01-10 | 1 | 1 | Industrial | 55.0 | 3.80 | Fixed | 1,000,000 | 15,454,545 | 88 | | L04 | 2,100,000 | 2023-09-01 | 2030-09-01 | 4 | 5 | Retail | 78.0 | 6.10 | Floating | 300,000 | 2,692,308 | 58 | | L05 | 12,000,000 | 2019-04-20 | 2031-04-20 | 2 | 3 | Office | 60.0 | 4.20 | Fixed | 2,000,000 | 20,000,000 | 75 | | L06 | 1,800,000 | 2023-01-15 | 2035-01-15 | 3 | 3 | Industrial | 50.0 | 5.00 | Fixed | 0 | 3,600,000 | 82 | | L07 | 6,500,000 | 2021-07-01 | 2029-07-01 | 4 | 6 | Retail | 85.0 | 6.50 | Floating | 800,000 | 7,647,059 | 45 | | L08 | 4,000,000 | 2018-11-01 | 2030-11-01 | B+ | 4 | Office | 68.0 | 4.80 | Fixed | 500,000 | 5,882,353 | 70 | | L09 | 7,300,000 | 2022-03-01 | 2034-03-01 | 2 | 2 | Industrial | 45.0 | 4.00 | Fixed | 1,500,000 | 16,222,222 | 90 | | L10 | 3,500,000 | 2020-08-15 | 2028-08-15 | 3 | 5 | Retail | 80.0 | 5.80 | Floating | 0 | 4,375,000 | 52 | | L11 | 9,000,000 | 2021-01-01 | 2033-01-01 | 1 | 2 | Office | 58.0 | 4.10 | Fixed | 1,200,000 | 15,517,241 | 85 | | L12 | 2,500,000 | 2024-01-15 | 2031-01-15 | 5 | 5 | Retail | 75.0 | 6.80 | Floating | 400,000 | 3,333,333 | 48 | | L13 | 15,000,000 | 2020-06-01 | 2037-06-01 | 1 | 1 | Industrial | 40.0 | 3.50 | Fixed | 3,000,000 | 37,500,000 | 92 | | L14 | 1,200,000 | 2023-06-01 | 2028-06-01 | 6 | 7 | Retail | 92.0 | 8.50 | Fixed | 0 | 1,304,348 | 35 | | L15 | 4,800,000 | 2019-12-01 | 2031-12-01 | 3 | 4 | Office | 70.0 | 4.60 | Fixed | 600,000 | 6,857,143 | 68 | | L16 | 6,000,000 | 2022-09-01 | 2030-09-01 | 2 | 3 | Industrial | 52.0 | 4.30 | Fixed | 800,000 | 11,538,462 | 80 | | L17 | 2,800,000 | 2023-03-15 | 2030-03-15 | 4 | 4 | Retail | 74.0 | 5.90 | Fixed | 200,000 | 3,783,784 | 62 | | L18 | 10,500,000 | 2020-10-01 | 2032-10-01 | 2 | 3 | Office | 62.0 | 4.40 | Fixed | 1,500,000 | 16,935,484 | 73 | | L19 | 1,500,000 | 2024-06-01 | 2031-06-01 | 5 | 6 | Retail | 88.0 | 7.20 | Floating | 0 | 1,704,545 | 40 | | L20 | 8,000,000 | 2021-09-01 | 2033-09-01 | 2 | 2 | Industrial | 48.0 | 4.10 | Fixed | 1,000,000 | 16,666,667 | 86 |

Total portfolio balance: $115,200,000 across 20 loans.

3. Internal Rating Transition Matrix

7-grade scale (Grade 1 = strongest, Grade 7 = weakest) plus Default (D) absorbing state. Calibrated to 2010-2025 default experience.

| From \ To | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | Default | |-----------|---------|---------|---------|---------|---------|---------|---------|---------| | Grade 1 | 0.9200 | 0.0650 | 0.0100 | 0.0030 | 0.0010 | 0.0005 | 0.0003 | 0.0002 | | Grade 2 | 0.0100 | 0.9050 | 0.0600 | 0.0150 | 0.0050 | 0.0025 | 0.0015 | 0.0010 | | Grade 3 | 0.0020 | 0.0250 | 0.8800 | 0.0550 | 0.0200 | 0.0100 | 0.0050 | 0.0030 | | Grade 4 | 0.0005 | 0.0050 | 0.0300 | 0.8550 | 0.0600 | 0.0250 | 0.0150 | 0.0095 | | Grade 5 | 0.0000 | 0.0020 | 0.0050 | 0.0300 | 0.8320 | 0.0700 | 0.0350 | 0.0280 | | Grade 6 | 0.0000 | 0.0005 | 0.0020 | 0.0050 | 0.0300 | 0.8000 | 0.0800 | 0.0825 | | Grade 7 | 0.0000 | 0.0000 | 0.0005 | 0.0020 | 0.0050 | 0.0400 | 0.7530 | 0.1995 | | Default | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 1.0000 |

4. Macroeconomic Scenarios

Three scenarios for probability-weighted ECL. All projections annual from 2026 through 2030. CRE Price Index rebased to 100.0 at reporting date.

Base Case (50%): Continued moderate expansion. GDP 2.0-2.3%, Unemployment 4.0-4.3%, CRE Index 100→110.

Downside (30%): Severe recession, CRE correction. GDP -3.2% to +1.5%, Unemployment 5.5-7.8%, CRE Index 100→82→86.

Upside (20%): Strong growth, tech/infrastructure expansion. GDP 2.5-3.5%, Unemployment 3.3-3.8%, CRE Index 100→123.

5. Approved ECL Methodology

Stage Classification (SICR):

Stage 1: No SICR → 12-month ECL
Stage 2: SICR triggered → lifetime ECL
Stage 3: Credit-impaired (Grade 7 or Default) → lifetime ECL

SICR triggers: annualized lifetime PD increased >100% relative to origination OR absolute increase >150bps.

PD Term Structure: Matrix exponentiation M^t for cumulative PD. Macro overlay: PD_adjusted = PD_base * exp(beta_GDP * delta_GDP + beta_Unemp * delta_Unemp) with beta_GDP = -0.025, beta_Unemp = 0.018.

LGD: Collateral-based: LGD = max(0, 1 - collateral_adjusted / EAD). Haircuts: Office 30%, Retail 35%, Industrial 25%.

EAD: EAD = Balance + CCF * Undrawn. Stage 1 CCF = 75%, Stage 2/3 CCF = 100%.

ECL: ECL(t) = marginal_PD_adjusted(t) * LGD * EAD * DF(t). Weighted: 50% Base + 30% Downside + 20% Upside.

FIELD 3: GOLDEN SOLUTION

Executive Summary

The probability-weighted ECL provision for the 20-loan CRE portfolio is $858,024, representing 0.74% of the total outstanding balance of $115,200,000. The provision is heavily concentrated in the Retail segment ($851,410, 99.2% of total) and in Stage 2 loans ($692,337, 80.7% of total). Ten loans have zero ECL due to full collateral coverage after haircuts.

Data Quality Issues Found

Transition Matrix Row-Sum Error (Critical): Grade 5 row sums to 1.002. Resolution: Normalize.
Legacy Rating for Loan L08: B+ mapped to Grade 3 per Note (a). Approximate.
Macro Narrative Inconsistency: Downside says 25% decline but CRE index shows max 18%.
ESG Scores: Informational only per Note (b). Excluded.

Stage Classification

Summary: 16 Stage 1, 3 Stage 2 (L07, L10, L19), 1 Stage 3 (L14).

| Loan | Orig Rtg | Curr Rtg | Orig Ann PD | Curr Ann PD | Relative | Absolute | Stage | |------|----------|----------|-------------|-------------|----------|----------|-------| | L01 | 2 | 2 | 0.60% | 0.30% | 0.50x | -0.30% | 1 | | L02 | 3 | 4 | 1.05% | 1.41% | 1.34x | +0.36% | 1 | | L03 | 1 | 1 | 0.27% | 0.11% | 0.39x | -0.17% | 1 | | L04 | 4 | 5 | 2.32% | 3.63% | 1.56x | +1.31% | 1 | | L05 | 2 | 3 | 0.72% | 0.80% | 1.11x | +0.08% | 1 | | L06 | 3 | 3 | 1.57% | 1.30% | 0.83x | -0.27% | 1 | | L07 | 4 | 6 | 2.49% | 8.28% | 3.32x | +5.79% | 2 | | L08 | B+(~3) | 4 | 1.57% | 2.11% | 1.34x | +0.54% | 1 | | L09 | 2 | 2 | 0.72% | 0.48% | 0.66x | -0.24% | 1 | | L10 | 3 | 5 | 1.16% | 2.75% | 2.37x | +1.59% | 2 | | L11 | 1 | 2 | 0.27% | 0.43% | 1.58x | +0.16% | 1 | | L12 | 5 | 5 | 4.79% | 4.49% | 0.94x | -0.30% | 1 | | L13 | 1 | 1 | 0.45% | 0.24% | 0.53x | -0.21% | 1 | | L14 | 6 | 7 | 9.37% | 18.18% | 1.94x | +8.82% | 3 | | L15 | 3 | 4 | 1.57% | 2.26% | 1.44x | +0.69% | 1 | | L16 | 2 | 3 | 0.48% | 0.62% | 1.30x | +0.14% | 1 | | L17 | 4 | 4 | 2.32% | 1.74% | 0.75x | -0.58% | 1 | | L18 | 2 | 3 | 0.72% | 1.13% | 1.56x | +0.40% | 1 | | L19 | 5 | 6 | 4.79% | 9.08% | 1.89x | +4.29% | 2 | | L20 | 2 | 2 | 0.72% | 0.39% | 0.54x | -0.33% | 1 |

ECL by Loan

| Loan | Stage | EAD ($) | LGD | ECL Weighted ($) | |------|-------|---------|-----|-----------------| | L01 | 1 | 5,000,000 | 0.000 | 0 | | L02 | 1 | 3,575,000 | 0.192 | 6,383 | | L03 | 1 | 9,250,000 | 0.000 | 0 | | L04 | 1 | 2,325,000 | 0.247 | 15,601 | | L05 | 1 | 13,500,000 | 0.000 | 0 | | L06 | 1 | 1,800,000 | 0.000 | 0 | | L07 | 2 | 7,300,000 | 0.319 | 523,386 | | L08 | 1 | 4,375,000 | 0.059 | 2,403 | | L09 | 1 | 8,425,000 | 0.000 | 0 | | L10 | 2 | 3,500,000 | 0.188 | 40,395 | | L11 | 1 | 9,900,000 | 0.000 | 0 | | L12 | 1 | 2,800,000 | 0.226 | 17,071 | | L13 | 1 | 17,250,000 | 0.000 | 0 | | L14 | 3 | 1,200,000 | 0.294 | 115,485 | | L15 | 1 | 5,250,000 | 0.086 | 4,210 | | L16 | 1 | 6,600,000 | 0.000 | 0 | | L17 | 1 | 2,950,000 | 0.166 | 4,533 | | L18 | 1 | 11,625,000 | 0.000 | 0 | | L19 | 2 | 1,500,000 | 0.261 | 128,556 | | L20 | 1 | 8,750,000 | 0.000 | 0 |

Summary Tables

By Segment:

| Segment | Loans | Balance ($) | ECL ($) | ECL/Balance | |---------|-------|------------|---------|-------------| | Office | 6 | 45,300,000 | 6,614 | 0.01% | | Retail | 8 | 23,300,000 | 851,410 | 3.65% | | Industrial | 6 | 46,600,000 | 0 | 0.00% | | Total | 20 | 115,200,000 | 858,024 | 0.74% |

By Stage:

| Stage | Loans | Balance ($) | ECL ($) | ECL/Balance | |-------|-------|------------|---------|-------------| | Stage 1 | 16 | 102,500,000 | 50,202 | 0.05% | | Stage 2 | 3 | 11,500,000 | 692,337 | 6.02% | | Stage 3 | 1 | 1,200,000 | 115,485 | 9.62% | | Total | 20 | 115,200,000 | 858,024 | 0.74% |

Sensitivity Analysis

Collateral haircuts are the dominant sensitivity (+/-19.4% vs +/-1.4% for scenario weights).

FIELD 4: RUBRIC (100 Points)

| # | Criterion | Points | Key Failure Modes | |---|-----------|--------|-------------------| | R01 | Transition Matrix Validation | 12 | FM-003, FM-004 | | R02 | Legacy Rating + Macro Issues | 10 | FM-004, FM-002 | | R03 | SICR Staging Methodology | 15 | FM-001, FM-008 | | R04 | PD Term Structure (matrix power) | 10 | FM-003, FM-008 | | R05 | LGD and EAD Computation | 10 | FM-001, FM-005 | | R06 | ECL Computation + Macro Overlay | 15 | FM-001, FM-003 | | R07 | Probability Weighting + Aggregation | 8 | FM-001 | | R08 | Sensitivity Analysis | 8 | FM-002 | | R09 | Assumptions + Limitations | 7 | FM-002, FM-005 | | R10 | Code Quality + Reproducibility | 5 | -- | | | TOTAL | 100 | |

Tiers: Strong 75-100 | Moderate 45-74 | Weak 0-44

Automated Unit Tests (14)

| ID | Test | Expected | Type | |----|------|----------|------| | T01 | Grade 5 row sum flagged as != 1.0 | Flagged | HARD | | T02 | L08 B+ mapped to numeric grade | Grade 3 | HARD | | T03 | L07 = Stage 2 | Stage 2 | HARD | | T04 | L04 = Stage 1 (boundary) | Stage 1 | HARD | | T05 | L14 = Stage 3 | Stage 3 | HARD | | T06 | 5yr cumPD Grade 4 via matrix power | 4.8-5.2% | SOFT | | T07 | All Industrial loans LGD = 0 | LGD = 0 | HARD | | T08 | L07 LGD ~ 0.319 | 0.30-0.34 | SOFT | | T09 | Portfolio ECL within 10% of $858K | $772K-$944K | SOFT | | T10 | L07 has largest individual ECL | L07 > all | HARD | | T11 | Downside ECL > Base > Upside | Ordered | HARD | | T12 | ESG not used in computation | Excluded | HARD | | T13 | 25% vs 18% macro inconsistency flagged | Noted | HARD | | T14 | Floating loans use current rate as EIR | Confirmed | SOFT |

FIELD 5: FAILURE ANALYSIS

Predicted Gemini 3.0 Pro Performance: 30-45/100 (Weak tier)

Cross-Model Predictions:

| Model | SICR | Matrix | ESG | L08 | Macro | Total | |-------|------|--------|-----|-----|-------|-------| | Gemini 3.0 Pro | Fail | Fail | Fail | Partial | Fail | 30-45 | | GPT-4o | Fail | Fail | Partial | Partial | Fail | 35-50 | | Claude Sonnet 4 | Partial | Partial | Pass | Partial | Pass | 45-60 | | DeepSeek-V3 | Fail | Fail | Fail | Fail | Fail | 25-40 |

Follow-Up Hints (F1-F6): Transition matrix validation, SICR annualization, legacy rating L08, macro inconsistency, ESG/recovery rates, worked example for L07.

Expected Recovery: Initial ~35 → Final ~55-72/100 (recovery rate ~50-67%)

Part 2: Predictive Maintenance ML Pipeline Benchmark

Domain: Machine Learning Engineering Estimated human time: 4-8 hours Predicted Gemini 3.0 Pro score: 20-40 / 100

Task Prompt Summary

Build a complete ML pipeline predicting equipment failures 24h in advance from 90 days of sensor data (30 machines, 5 sensors, ~100 failure events). Asymmetric costs: $50K missed failure vs $2K false alarm.

Golden Solution Key Results

Temporal split (NOT random): Train Jan 1-Feb 24, Val Feb 25-Mar 14, Test Mar 15-Mar 31
39 features: rolling stats, rate-of-change, cross-sensor, time features (all backward-looking)
GradientBoostingClassifier with 27.4x class weight
Optimal threshold: 0.050 (far below 0.5 due to 25:1 cost asymmetry)
Test F1: 0.244 (correct — if F1 > 0.70, temporal leakage present)
Test Recall: 0.654, Precision: 0.150, PR-AUC: 0.305

Rubric (100 Points)

| # | Criterion | Points | Primary Trap | |---|-----------|--------|-------------| | R04 | TEMPORAL train/test split | 20 | Random split = leakage | | R06 | Appropriate evaluation metrics | 12 | Using accuracy as primary metric | | R07 | Threshold optimization on validation set | 12 | Optimizing on test set | | R05 | Class imbalance handling | 10 | No handling = predict all-negative | | R03 | Feature engineering quality | 10 | Future data in features | | R09 | Leakage awareness and prevention | 10 | No discussion of leakage | | R02 | Correct label construction | 8 | Look-ahead bias | | R08 | Business cost analysis | 8 | Ignoring cost asymmetry | | R01 | Data exploration | 5 | -- | | R10 | Code quality | 5 | -- |

DIAGNOSTIC: If test F1 > 0.70, the model has temporal leakage. Common failure: train_test_split(random_state=42).

Cross-Model Predictions

| Model | Split | Metrics | Threshold | Leakage Aware | Total | |-------|-------|---------|-----------|---------------|-------| | Gemini 3.0 Pro | Random (0) | Accuracy (0) | Default (0) | No (0) | 20-40 | | GPT-4o | Random (0) | Mixed (4) | Partial (4) | Partial (3) | 25-45 | | Claude Sonnet 4 | Temporal (12) | Proper (8) | Val-set (8) | Yes (6) | 45-65 | | DeepSeek-V3 | Random (0) | Accuracy (0) | Default (0) | No (0) | 15-35 |

Common failure across all models: Random train/test splitting — train_test_split() appears in virtually every ML tutorial in training data.