IFRS 9 ECL Benchmark Submission
Provenance: Ingested from
C:\Users\mesha\Downloads\BENCHMARK_COMPLETE.md(~31 KB) andC:\Users\mesha\Downloads\BENCHMARK.md(~21 KB) on 2026-03-28. BENCHMARK_COMPLETE.md contains the full IFRS 9 ECL benchmark submission package (all 5 Mercor fields inline). BENCHMARK.md contains the Predictive Maintenance ML Pipeline benchmark. Both are complete LLM evaluation benchmarks with task prompts, golden solutions, rubrics, automated unit tests, and failure analysis.
Part 1: IFRS 9 ECL Benchmark — Complete Submission Package
Directory Map
All files are in: mercor-llm-failsafe/outputs/benchmark_ifrs9_ecl/
benchmark_ifrs9_ecl/
|
|-- BENCHMARK_COMPLETE.md <-- THIS FILE (index + all 5 Mercor fields inline)
|
|-- task_prompt.md <-- Field 1: Copy-paste this into Gemini
|-- golden_solution_narrative.md <-- Field 3: Step-by-step correct answer
|-- golden_solution.py <-- Runnable code that produces exact numbers
|-- golden_results.json <-- Exact numerical outputs (auto-generated)
|-- rubric.md <-- Field 4: 100-point scoring rubric + 14 unit tests
|-- follow_ups.md <-- 6 progressive hints for recovery testing
|-- failure_analysis.md <-- Field 5: Why Gemini fails + cross-model comparison
|-- mercor_export.md <-- Summary/overview of all 5 fields
Workflow
- Open
task_prompt.md, copy its full content, paste into Gemini 3.0 Pro - Save Gemini's response
- Score against
rubric.md(14 unit tests, 100-point scale) - Send hints from
follow_ups.md(F1 through F6), record recovery - Get shareable Gemini conversation link -> paste into Field 2
- Write final
failure_analysis.mdwith actual (not just predicted) results
FIELD 1: TASK PROMPT
IFRS 9 Expected Credit Loss -- Commercial Real Estate Portfolio
Memorandum
TO: Credit Risk Analytics Team FROM: Sarah Chen, Chief Risk Officer RE: Q1 2026 IFRS 9 Expected Credit Loss Computation -- CRE Portfolio DATE: March 31, 2026 CLASSIFICATION: Internal -- Regulatory Reporting
1. Background and Objectives
The bank is required to compute Expected Credit Loss (ECL) provisions under IFRS 9 for its commercial real estate (CRE) loan portfolio as of the Q1 2026 reporting date (March 31, 2026). The ECL computation must comply with IFRS 9.5.5 requirements, including forward-looking macroeconomic scenarios and staging based on significant increases in credit risk (SICR).
This portfolio consists of 20 CRE loans across three property segments (Office, Retail, Industrial). You are to compute the full ECL provision using the bank's approved methodology described in Section 5.
This analysis feeds directly into the bank's quarterly regulatory filing. Accuracy and auditability are paramount.
2. Portfolio Data
Reporting date: March 31, 2026. All balances in USD.
| Loan ID | Balance ($) | Origination Date | Maturity Date | Orig Rating | Current Rating | Segment | LTV (%) | Interest Rate (%) | Rate Type | Undrawn Commitment ($) | Collateral Value ($) | ESG Risk Score | |---------|------------|------------------|---------------|-------------|----------------|---------|---------|-------------------|-----------|----------------------|---------------------|----------------| | L01 | 5,000,000 | 2021-03-15 | 2031-03-15 | 2 | 2 | Office | 65.0 | 4.50 | Fixed | 0 | 7,692,308 | 72 | | L02 | 3,200,000 | 2022-06-01 | 2029-06-01 | 3 | 4 | Retail | 72.0 | 5.20 | Fixed | 500,000 | 4,444,444 | 65 | | L03 | 8,500,000 | 2020-01-10 | 2032-01-10 | 1 | 1 | Industrial | 55.0 | 3.80 | Fixed | 1,000,000 | 15,454,545 | 88 | | L04 | 2,100,000 | 2023-09-01 | 2030-09-01 | 4 | 5 | Retail | 78.0 | 6.10 | Floating | 300,000 | 2,692,308 | 58 | | L05 | 12,000,000 | 2019-04-20 | 2031-04-20 | 2 | 3 | Office | 60.0 | 4.20 | Fixed | 2,000,000 | 20,000,000 | 75 | | L06 | 1,800,000 | 2023-01-15 | 2035-01-15 | 3 | 3 | Industrial | 50.0 | 5.00 | Fixed | 0 | 3,600,000 | 82 | | L07 | 6,500,000 | 2021-07-01 | 2029-07-01 | 4 | 6 | Retail | 85.0 | 6.50 | Floating | 800,000 | 7,647,059 | 45 | | L08 | 4,000,000 | 2018-11-01 | 2030-11-01 | B+ | 4 | Office | 68.0 | 4.80 | Fixed | 500,000 | 5,882,353 | 70 | | L09 | 7,300,000 | 2022-03-01 | 2034-03-01 | 2 | 2 | Industrial | 45.0 | 4.00 | Fixed | 1,500,000 | 16,222,222 | 90 | | L10 | 3,500,000 | 2020-08-15 | 2028-08-15 | 3 | 5 | Retail | 80.0 | 5.80 | Floating | 0 | 4,375,000 | 52 | | L11 | 9,000,000 | 2021-01-01 | 2033-01-01 | 1 | 2 | Office | 58.0 | 4.10 | Fixed | 1,200,000 | 15,517,241 | 85 | | L12 | 2,500,000 | 2024-01-15 | 2031-01-15 | 5 | 5 | Retail | 75.0 | 6.80 | Floating | 400,000 | 3,333,333 | 48 | | L13 | 15,000,000 | 2020-06-01 | 2037-06-01 | 1 | 1 | Industrial | 40.0 | 3.50 | Fixed | 3,000,000 | 37,500,000 | 92 | | L14 | 1,200,000 | 2023-06-01 | 2028-06-01 | 6 | 7 | Retail | 92.0 | 8.50 | Fixed | 0 | 1,304,348 | 35 | | L15 | 4,800,000 | 2019-12-01 | 2031-12-01 | 3 | 4 | Office | 70.0 | 4.60 | Fixed | 600,000 | 6,857,143 | 68 | | L16 | 6,000,000 | 2022-09-01 | 2030-09-01 | 2 | 3 | Industrial | 52.0 | 4.30 | Fixed | 800,000 | 11,538,462 | 80 | | L17 | 2,800,000 | 2023-03-15 | 2030-03-15 | 4 | 4 | Retail | 74.0 | 5.90 | Fixed | 200,000 | 3,783,784 | 62 | | L18 | 10,500,000 | 2020-10-01 | 2032-10-01 | 2 | 3 | Office | 62.0 | 4.40 | Fixed | 1,500,000 | 16,935,484 | 73 | | L19 | 1,500,000 | 2024-06-01 | 2031-06-01 | 5 | 6 | Retail | 88.0 | 7.20 | Floating | 0 | 1,704,545 | 40 | | L20 | 8,000,000 | 2021-09-01 | 2033-09-01 | 2 | 2 | Industrial | 48.0 | 4.10 | Fixed | 1,000,000 | 16,666,667 | 86 |
Total portfolio balance: $115,200,000 across 20 loans.
3. Internal Rating Transition Matrix
7-grade scale (Grade 1 = strongest, Grade 7 = weakest) plus Default (D) absorbing state. Calibrated to 2010-2025 default experience.
| From \ To | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | Default | |-----------|---------|---------|---------|---------|---------|---------|---------|---------| | Grade 1 | 0.9200 | 0.0650 | 0.0100 | 0.0030 | 0.0010 | 0.0005 | 0.0003 | 0.0002 | | Grade 2 | 0.0100 | 0.9050 | 0.0600 | 0.0150 | 0.0050 | 0.0025 | 0.0015 | 0.0010 | | Grade 3 | 0.0020 | 0.0250 | 0.8800 | 0.0550 | 0.0200 | 0.0100 | 0.0050 | 0.0030 | | Grade 4 | 0.0005 | 0.0050 | 0.0300 | 0.8550 | 0.0600 | 0.0250 | 0.0150 | 0.0095 | | Grade 5 | 0.0000 | 0.0020 | 0.0050 | 0.0300 | 0.8320 | 0.0700 | 0.0350 | 0.0280 | | Grade 6 | 0.0000 | 0.0005 | 0.0020 | 0.0050 | 0.0300 | 0.8000 | 0.0800 | 0.0825 | | Grade 7 | 0.0000 | 0.0000 | 0.0005 | 0.0020 | 0.0050 | 0.0400 | 0.7530 | 0.1995 | | Default | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 1.0000 |
4. Macroeconomic Scenarios
Three scenarios for probability-weighted ECL. All projections annual from 2026 through 2030. CRE Price Index rebased to 100.0 at reporting date.
Base Case (50%): Continued moderate expansion. GDP 2.0-2.3%, Unemployment 4.0-4.3%, CRE Index 100→110.
Downside (30%): Severe recession, CRE correction. GDP -3.2% to +1.5%, Unemployment 5.5-7.8%, CRE Index 100→82→86.
Upside (20%): Strong growth, tech/infrastructure expansion. GDP 2.5-3.5%, Unemployment 3.3-3.8%, CRE Index 100→123.
5. Approved ECL Methodology
Stage Classification (SICR):
- Stage 1: No SICR → 12-month ECL
- Stage 2: SICR triggered → lifetime ECL
- Stage 3: Credit-impaired (Grade 7 or Default) → lifetime ECL
SICR triggers: annualized lifetime PD increased >100% relative to origination OR absolute increase >150bps.
PD Term Structure: Matrix exponentiation M^t for cumulative PD. Macro overlay: PD_adjusted = PD_base * exp(beta_GDP * delta_GDP + beta_Unemp * delta_Unemp) with beta_GDP = -0.025, beta_Unemp = 0.018.
LGD: Collateral-based: LGD = max(0, 1 - collateral_adjusted / EAD). Haircuts: Office 30%, Retail 35%, Industrial 25%.
EAD: EAD = Balance + CCF * Undrawn. Stage 1 CCF = 75%, Stage 2/3 CCF = 100%.
ECL: ECL(t) = marginal_PD_adjusted(t) * LGD * EAD * DF(t). Weighted: 50% Base + 30% Downside + 20% Upside.
FIELD 3: GOLDEN SOLUTION
Executive Summary
The probability-weighted ECL provision for the 20-loan CRE portfolio is $858,024, representing 0.74% of the total outstanding balance of $115,200,000. The provision is heavily concentrated in the Retail segment ($851,410, 99.2% of total) and in Stage 2 loans ($692,337, 80.7% of total). Ten loans have zero ECL due to full collateral coverage after haircuts.
Data Quality Issues Found
- Transition Matrix Row-Sum Error (Critical): Grade 5 row sums to 1.002. Resolution: Normalize.
- Legacy Rating for Loan L08: B+ mapped to Grade 3 per Note (a). Approximate.
- Macro Narrative Inconsistency: Downside says 25% decline but CRE index shows max 18%.
- ESG Scores: Informational only per Note (b). Excluded.
Stage Classification
Summary: 16 Stage 1, 3 Stage 2 (L07, L10, L19), 1 Stage 3 (L14).
| Loan | Orig Rtg | Curr Rtg | Orig Ann PD | Curr Ann PD | Relative | Absolute | Stage | |------|----------|----------|-------------|-------------|----------|----------|-------| | L01 | 2 | 2 | 0.60% | 0.30% | 0.50x | -0.30% | 1 | | L02 | 3 | 4 | 1.05% | 1.41% | 1.34x | +0.36% | 1 | | L03 | 1 | 1 | 0.27% | 0.11% | 0.39x | -0.17% | 1 | | L04 | 4 | 5 | 2.32% | 3.63% | 1.56x | +1.31% | 1 | | L05 | 2 | 3 | 0.72% | 0.80% | 1.11x | +0.08% | 1 | | L06 | 3 | 3 | 1.57% | 1.30% | 0.83x | -0.27% | 1 | | L07 | 4 | 6 | 2.49% | 8.28% | 3.32x | +5.79% | 2 | | L08 | B+(~3) | 4 | 1.57% | 2.11% | 1.34x | +0.54% | 1 | | L09 | 2 | 2 | 0.72% | 0.48% | 0.66x | -0.24% | 1 | | L10 | 3 | 5 | 1.16% | 2.75% | 2.37x | +1.59% | 2 | | L11 | 1 | 2 | 0.27% | 0.43% | 1.58x | +0.16% | 1 | | L12 | 5 | 5 | 4.79% | 4.49% | 0.94x | -0.30% | 1 | | L13 | 1 | 1 | 0.45% | 0.24% | 0.53x | -0.21% | 1 | | L14 | 6 | 7 | 9.37% | 18.18% | 1.94x | +8.82% | 3 | | L15 | 3 | 4 | 1.57% | 2.26% | 1.44x | +0.69% | 1 | | L16 | 2 | 3 | 0.48% | 0.62% | 1.30x | +0.14% | 1 | | L17 | 4 | 4 | 2.32% | 1.74% | 0.75x | -0.58% | 1 | | L18 | 2 | 3 | 0.72% | 1.13% | 1.56x | +0.40% | 1 | | L19 | 5 | 6 | 4.79% | 9.08% | 1.89x | +4.29% | 2 | | L20 | 2 | 2 | 0.72% | 0.39% | 0.54x | -0.33% | 1 |
ECL by Loan
| Loan | Stage | EAD ($) | LGD | ECL Weighted ($) | |------|-------|---------|-----|-----------------| | L01 | 1 | 5,000,000 | 0.000 | 0 | | L02 | 1 | 3,575,000 | 0.192 | 6,383 | | L03 | 1 | 9,250,000 | 0.000 | 0 | | L04 | 1 | 2,325,000 | 0.247 | 15,601 | | L05 | 1 | 13,500,000 | 0.000 | 0 | | L06 | 1 | 1,800,000 | 0.000 | 0 | | L07 | 2 | 7,300,000 | 0.319 | 523,386 | | L08 | 1 | 4,375,000 | 0.059 | 2,403 | | L09 | 1 | 8,425,000 | 0.000 | 0 | | L10 | 2 | 3,500,000 | 0.188 | 40,395 | | L11 | 1 | 9,900,000 | 0.000 | 0 | | L12 | 1 | 2,800,000 | 0.226 | 17,071 | | L13 | 1 | 17,250,000 | 0.000 | 0 | | L14 | 3 | 1,200,000 | 0.294 | 115,485 | | L15 | 1 | 5,250,000 | 0.086 | 4,210 | | L16 | 1 | 6,600,000 | 0.000 | 0 | | L17 | 1 | 2,950,000 | 0.166 | 4,533 | | L18 | 1 | 11,625,000 | 0.000 | 0 | | L19 | 2 | 1,500,000 | 0.261 | 128,556 | | L20 | 1 | 8,750,000 | 0.000 | 0 |
Summary Tables
By Segment:
| Segment | Loans | Balance ($) | ECL ($) | ECL/Balance | |---------|-------|------------|---------|-------------| | Office | 6 | 45,300,000 | 6,614 | 0.01% | | Retail | 8 | 23,300,000 | 851,410 | 3.65% | | Industrial | 6 | 46,600,000 | 0 | 0.00% | | Total | 20 | 115,200,000 | 858,024 | 0.74% |
By Stage:
| Stage | Loans | Balance ($) | ECL ($) | ECL/Balance | |-------|-------|------------|---------|-------------| | Stage 1 | 16 | 102,500,000 | 50,202 | 0.05% | | Stage 2 | 3 | 11,500,000 | 692,337 | 6.02% | | Stage 3 | 1 | 1,200,000 | 115,485 | 9.62% | | Total | 20 | 115,200,000 | 858,024 | 0.74% |
Sensitivity Analysis
Collateral haircuts are the dominant sensitivity (+/-19.4% vs +/-1.4% for scenario weights).
FIELD 4: RUBRIC (100 Points)
| # | Criterion | Points | Key Failure Modes | |---|-----------|--------|-------------------| | R01 | Transition Matrix Validation | 12 | FM-003, FM-004 | | R02 | Legacy Rating + Macro Issues | 10 | FM-004, FM-002 | | R03 | SICR Staging Methodology | 15 | FM-001, FM-008 | | R04 | PD Term Structure (matrix power) | 10 | FM-003, FM-008 | | R05 | LGD and EAD Computation | 10 | FM-001, FM-005 | | R06 | ECL Computation + Macro Overlay | 15 | FM-001, FM-003 | | R07 | Probability Weighting + Aggregation | 8 | FM-001 | | R08 | Sensitivity Analysis | 8 | FM-002 | | R09 | Assumptions + Limitations | 7 | FM-002, FM-005 | | R10 | Code Quality + Reproducibility | 5 | -- | | | TOTAL | 100 | |
Tiers: Strong 75-100 | Moderate 45-74 | Weak 0-44
Automated Unit Tests (14)
| ID | Test | Expected | Type | |----|------|----------|------| | T01 | Grade 5 row sum flagged as != 1.0 | Flagged | HARD | | T02 | L08 B+ mapped to numeric grade | Grade 3 | HARD | | T03 | L07 = Stage 2 | Stage 2 | HARD | | T04 | L04 = Stage 1 (boundary) | Stage 1 | HARD | | T05 | L14 = Stage 3 | Stage 3 | HARD | | T06 | 5yr cumPD Grade 4 via matrix power | 4.8-5.2% | SOFT | | T07 | All Industrial loans LGD = 0 | LGD = 0 | HARD | | T08 | L07 LGD ~ 0.319 | 0.30-0.34 | SOFT | | T09 | Portfolio ECL within 10% of $858K | $772K-$944K | SOFT | | T10 | L07 has largest individual ECL | L07 > all | HARD | | T11 | Downside ECL > Base > Upside | Ordered | HARD | | T12 | ESG not used in computation | Excluded | HARD | | T13 | 25% vs 18% macro inconsistency flagged | Noted | HARD | | T14 | Floating loans use current rate as EIR | Confirmed | SOFT |
FIELD 5: FAILURE ANALYSIS
Predicted Gemini 3.0 Pro Performance: 30-45/100 (Weak tier)
Cross-Model Predictions:
| Model | SICR | Matrix | ESG | L08 | Macro | Total | |-------|------|--------|-----|-----|-------|-------| | Gemini 3.0 Pro | Fail | Fail | Fail | Partial | Fail | 30-45 | | GPT-4o | Fail | Fail | Partial | Partial | Fail | 35-50 | | Claude Sonnet 4 | Partial | Partial | Pass | Partial | Pass | 45-60 | | DeepSeek-V3 | Fail | Fail | Fail | Fail | Fail | 25-40 |
Follow-Up Hints (F1-F6): Transition matrix validation, SICR annualization, legacy rating L08, macro inconsistency, ESG/recovery rates, worked example for L07.
Expected Recovery: Initial ~35 → Final ~55-72/100 (recovery rate ~50-67%)
Part 2: Predictive Maintenance ML Pipeline Benchmark
Domain: Machine Learning Engineering Estimated human time: 4-8 hours Predicted Gemini 3.0 Pro score: 20-40 / 100
Task Prompt Summary
Build a complete ML pipeline predicting equipment failures 24h in advance from 90 days of sensor data (30 machines, 5 sensors, ~100 failure events). Asymmetric costs: $50K missed failure vs $2K false alarm.
Golden Solution Key Results
- Temporal split (NOT random): Train Jan 1-Feb 24, Val Feb 25-Mar 14, Test Mar 15-Mar 31
- 39 features: rolling stats, rate-of-change, cross-sensor, time features (all backward-looking)
- GradientBoostingClassifier with 27.4x class weight
- Optimal threshold: 0.050 (far below 0.5 due to 25:1 cost asymmetry)
- Test F1: 0.244 (correct — if F1 > 0.70, temporal leakage present)
- Test Recall: 0.654, Precision: 0.150, PR-AUC: 0.305
Rubric (100 Points)
| # | Criterion | Points | Primary Trap | |---|-----------|--------|-------------| | R04 | TEMPORAL train/test split | 20 | Random split = leakage | | R06 | Appropriate evaluation metrics | 12 | Using accuracy as primary metric | | R07 | Threshold optimization on validation set | 12 | Optimizing on test set | | R05 | Class imbalance handling | 10 | No handling = predict all-negative | | R03 | Feature engineering quality | 10 | Future data in features | | R09 | Leakage awareness and prevention | 10 | No discussion of leakage | | R02 | Correct label construction | 8 | Look-ahead bias | | R08 | Business cost analysis | 8 | Ignoring cost asymmetry | | R01 | Data exploration | 5 | -- | | R10 | Code quality | 5 | -- |
DIAGNOSTIC: If test F1 > 0.70, the model has temporal leakage. Common failure: train_test_split(random_state=42).
Cross-Model Predictions
| Model | Split | Metrics | Threshold | Leakage Aware | Total | |-------|-------|---------|-----------|---------------|-------| | Gemini 3.0 Pro | Random (0) | Accuracy (0) | Default (0) | No (0) | 20-40 | | GPT-4o | Random (0) | Mixed (4) | Partial (4) | Partial (3) | 25-45 | | Claude Sonnet 4 | Temporal (12) | Proper (8) | Val-set (8) | Yes (6) | 45-65 | | DeepSeek-V3 | Random (0) | Accuracy (0) | Default (0) | No (0) | 15-35 |
Common failure across all models: Random train/test splitting — train_test_split() appears in virtually every ML tutorial in training data.