Failsafe Benchmark Tool Status

Source: failsafe-benchmark-tool-status.md (ingested 2026-03-28)

Here's what's built and working. failsafe — 4,293 lines of Python across 8 modules, fully functional CLI. What it does right now: failsafe generate --domain quant_stats --subdomain competing_risks --learn runs the full 6-phase pipeline: selects failure modes, generates a 200-turbine synthetic dataset with known ground truth, constructs the complete task prompt (written as an internal memo, not an exam question), builds a structured golden solution skeleton, produces a 100-point rubric with 10 criteria, generates 12 automated tests, and creates 6 follow-up prompts for interactive evaluation. With --learn, every phase gets annotated with educational context explaining the why behind each design choice. failsafe failures --domain all -v dumps the full failure mode registry — 11 documented modes with mechanisms, scale-resistance arguments, trigger features, and learning annotations. failsafe export --dir outputs/benchmark_xxx --format mercor packages everything into the 5-field Mercor format. failsafe run --task task.json --model gemini-2.5-pro calls the model API and captures the response (requires API keys in env vars). failsafe learn --domain quant_stats --topic "competing risks" --quiz runs interactive drills on failure mechanisms. What's concrete vs. stub: Fully implemented: failure mode registry (8 core + 3 domain-specific), dataset generator (200 turbines, 5 CSVs, embedded inconsistency), task prompt, rubric, automated tests, follow-ups, learning annotations, quiz engine, API integration (Gemini/OpenAI/Anthropic), Mercor export. Stubs: golden solution has structured steps but not the complete runnable analysis code (that's the ~4hr expert work), evaluation scoring (test execution framework is there but needs the response parser), physics and ML/AI domain packs. Next steps in order of priority:

Flesh out the golden solution with actual runnable analysis code against the generated data Build the response evaluator that parses model output and runs the 12 automated tests Add physics domain pack (your Weyl semimetal / flat-band benchmarks map directly to FM-006 and FM-001) Add ML/AI domain pack (training pipeline debugging, distribution shift diagnosis)