Frontier Llm Benchmark Design Meta Prompt

Source: frontier-llm-benchmark-design-meta-prompt.md (ingested 2026-03-28)

Improved Prompt: Frontier Model Benchmark Designer — Industrial-Grade Task Architecture

Preamble and Operating Context

You are a benchmark research scientist working in agentic mode. Your deliverable is a complete, publication-ready benchmark task that meets the standards of venues like NeurIPS Datasets & Benchmarks, COLM, or HELM technical reports. The task must expose genuine, persistent capability gaps in frontier language models (GPT-4o, Claude 3.5+ Sonnet, Gemini 1.5 Pro, Llama 3.1 405B, or newer equivalents) while remaining solvable — not trivially, but reliably — by a skilled domain expert working a single focused session of 4–5 hours.

This is not a trick-question generator or a trivia quiz. You are building a measurement instrument. Every design choice must be justified the way an experimentalist justifies apparatus design: by explaining what it measures, why that measurement matters, and what confounds it controls for.

Key constraint on your own behavior: You will be tempted to produce something that looks impressive but is vague at the critical junctures — the actual data tables, the actual numerical answers, the actual scoring thresholds. Resist this. The value of your output is proportional to its specificity. A benchmark that cannot be run without further interpretation by the designer is not a benchmark; it is a sketch.

PHASE 1 — FAILURE MODE ANALYSIS

1.1 Objective

Identify and rigorously document exactly five persistent, empirically observed failure modes of state-of-the-art large language models in applied professional domains. These are the failure modes your benchmark task will be engineered to trigger.

1.2 Selection Criteria

Every failure mode you select must satisfy all four of the following:

Frontier-relevant: It occurs in the strongest publicly available models, not just weaker or older systems. If it has been fixed in the latest generation, discard it and choose another.
Ecologically valid: It is triggered by task conditions that arise naturally in real professional work — not by adversarial prompt injections, Unicode tricks, or deliberately malformed inputs.
Insidiously plausible: It produces outputs that read as confident, fluent, and superficially reasonable, making the failure dangerous rather than obviously wrong. A failure mode that produces gibberish is less interesting than one that produces a well-structured wrong answer.
Scale-resistant: There is a principled reason to believe it will not be eliminated simply by training a larger model on more data. Articulate that reason explicitly — appeal to architectural limitations, training objective misalignment, or information-theoretic arguments, not just hand-waving.

1.3 Required Documentation per Failure Mode

For each of the five failure modes, produce all of the following:

| Field | Requirement | |---|---| | Technical name | A precise, descriptive label (e.g., "Constraint Propagation Failure in Multi-Step Numerical Reasoning," not "Bad at math") | | Mechanism | A 2–4 sentence explanation of what happens computationally or behaviorally when the failure occurs. What does the model do instead of the correct thing? Where in the reasoning chain does the error enter? | | Concrete exemplars | Two specific, real-world professional contexts where this failure has been observed or would be expected. Be specific: name the domain, the task type, and the nature of the error. If you can cite a published evaluation, red-teaming report, or documented incident, do so. | | Persistence explanation | Why this failure mode survives RLHF, chain-of-thought prompting, increased context windows, and tool use. Be mechanistic, not vague. | | Cross-model prevalence | Rate on this scale: Rare (observed in ≤1 frontier model family) / Occasional (observed in some but not all) / Persistent (observed in most) / Near-universal (observed in all tested frontier models). Justify the rating. | | Detection difficulty | How hard is it for a non-expert reviewer to notice this failure in model output? Rate: Easy to spot / Requires domain knowledge / Requires careful re-derivation / Likely to go unnoticed without ground truth |

1.4 Suggested Failure Mode Categories

The following categories are starting points. You may refine, combine, split, or replace them, but if you deviate, justify why your alternatives are more productive for benchmark design.

Multi-step reasoning with implicit or cascading constraints — where satisfying a later constraint retroactively invalidates an earlier choice, and the model fails to backtrack
Assumption tracking and consistency maintenance — where the model makes an assumption in step 2, contradicts it in step 5, and does not notice
Numerical precision and unit/scale propagation — where small rounding or unit-conversion errors compound through a calculation chain into materially wrong final answers
Reconciliation of conflicting information sources — where two inputs disagree and the model either ignores the conflict, silently picks one, or averages them instead of flagging and reasoning about the discrepancy
Underdetermination recognition — where the problem as stated does not have a unique answer without additional assumptions, and the model confidently produces a specific answer without acknowledging the degrees of freedom
Framing and anchoring effects on quantitative judgment — where irrelevant numerical information in the prompt biases the model's estimates
Premature commitment in planning — where the model commits to a solution approach in its first paragraph and then fails to abandon it even when subsequent evidence makes it suboptimal
Memorized-knowledge override of provided context — where the model's parametric knowledge conflicts with data given in the prompt, and the model defaults to what it "knows" rather than what it was told

1.5 Deliverable for Phase 1

A structured analysis of exactly five failure modes, each documented as specified above, presented in a numbered list. At the end of the list, write a brief (3–5 sentence) synthesis explaining which pairs or triples of failure modes are most productively combined in a single benchmark task and why.

PHASE 2 — CANDIDATE TASK PROPOSALS

2.1 Objective

Propose exactly three candidate benchmark tasks, each grounded in a distinct real professional domain. These candidates are competing designs; you will select and fully develop only one in Phase 3.

2.2 Domain Selection

Choose from domains where the following conditions hold simultaneously:

Real practitioners perform multi-step quantitative or analytical reasoning under uncertainty
Input data in practice is messy, incomplete, or partially contradictory
Errors have material consequences (financial, clinical, operational, regulatory)
Domain expertise is genuinely needed — not just general intelligence
The domain is well-enough understood that a ground-truth solution can be established

Strong candidate domains include (but are not limited to):

Quantitative finance: portfolio risk, derivatives pricing under nonstandard conditions, credit modeling
Clinical data science or biostatistics: trial design, survival analysis, diagnostic test evaluation with messy registry data
ML engineering: debugging a training pipeline, diagnosing distribution shift, capacity planning
Operations research: vehicle routing, facility location, scheduling under stochastic demand
Environmental or energy policy: emissions accounting, grid capacity planning, regulatory impact assessment
Epidemiological modeling: outbreak projection, vaccine allocation, causal inference from observational data
Software architecture: system design under ambiguous and conflicting stakeholder requirements
Regulatory compliance: navigating overlapping or contradictory regulatory frameworks (e.g., cross-border data privacy, pharmaceutical submissions to multiple agencies)

2.3 Mandatory Criteria for Each Candidate

Evaluate each candidate against every one of the following criteria. For each criterion, write 1–3 sentences explaining how the candidate satisfies or fails to satisfy it. Do not simply assert "yes" — demonstrate it.

| # | Criterion | |---|---| | C1 | Requires at least four distinct reasoning steps with logical dependencies between them (i.e., the output of step N is a required input to step N+k, not just four independent sub-questions stapled together) | | C2 | Contains messy, ambiguous, or incomplete inputs that require the solver to make and explicitly state assumptions before proceeding — and different reasonable assumptions lead to materially different answers | | C3 | Cannot be solved by pattern matching to common textbook problems, Kaggle competitions, or frequently-seen training examples. Explain what makes this task structurally novel relative to likely training data. | | C4 | Cannot be solved by formula lookup and substitution. Even if a standard formula is relevant, the task requires adapting it, combining it with other reasoning, or recognizing that the standard formula does not apply under the given conditions. | | C5 | Takes a skilled human practitioner at least 4 hours to complete correctly. Provide a rough time budget for the major subtasks to justify this estimate. | | C6 | Triggers at least two (preferably three) of the failure modes identified in Phase 1. Name them and explain the triggering mechanism — which specific feature of the task activates which failure mode? | | C7 | Produces a verifiable output — there exists a correct answer (or a well-defined set of acceptable answers given stated assumptions) that can be checked by a reviewer without needing to redo the entire task from scratch. | | C8 | Resists partial credit through confident bluffing. The rubric must be designable such that a fluent but wrong answer scores materially lower than a less polished but correct one. Explain how the task structure enables this. |

2.4 Required Documentation per Candidate

For each of the three candidates, provide:

| Field | Requirement | |---|---| | Working title | Descriptive, professional (e.g., "Cross-Currency Swap Valuation Under Conflicting Market Data," not "Hard Finance Problem") | | Professional context | One paragraph describing the real-world situation this task simulates. Who commissioned this work? What decision depends on the output? What are the stakes? | | Target failure modes | Which failure modes from Phase 1 does this task target, and what specific features of the task trigger them? | | Expert profile | Job title, domain, approximate years of experience, and key skills of the practitioner who would realistically perform this task. | | Reasoning chain outline | A numbered list of the major reasoning steps required, with dependencies noted (e.g., "Step 3 depends on the output of Step 1 and the assumption made in Step 2"). Aim for 6–10 steps. | | Illusory success prediction | A candid paragraph explaining how a frontier model might produce an answer that looks correct on casual inspection but is wrong. What would the model likely get right? Where would it silently go wrong? What would the surface-level output look like? | | Data requirements | What data will need to be embedded in the prompt? Rough description of the tables, parameters, or documents the task requires. |

2.5 Comparative Assessment

After documenting all three candidates, write a comparison table and a 4–6 sentence assessment of relative strengths and weaknesses. Identify which candidate best balances:

Number and severity of failure modes triggered
Ecological validity (does it feel like real work?)
Verifiability of the correct answer
Feasibility of embedding realistic data in the prompt
Discriminative power (will it separate strong from weak responses, or will most responses cluster?)

PHASE 3 — TASK DEVELOPMENT

3.1 Selection and Justification

Select the strongest candidate from Phase 2. State your selection and provide a rigorous justification referencing the criteria from Phase 2.3 and the comparative assessment from Phase 2.5. If the selection involves tradeoffs, acknowledge them.

3.2 Task Prompt

Write the complete, exact prompt that will be delivered to the model under evaluation. This prompt is a standalone document. It must satisfy all of the following requirements:

Content requirements:

Self-contained: includes all data, tables, numerical inputs, constraints, context, and background the model needs. The model should not need to search the web, access external documents, or make API calls.
Contains at least one embedded data inconsistency — two pieces of provided information that cannot both be true simultaneously. The inconsistency should be subtle enough that a careless reader would not notice, but a careful analyst would flag it.
Contains at least one implicit constraint — a requirement that is not stated directly but follows logically from the combination of other stated requirements or from domain knowledge that the specified expert would possess.
Contains at least one red herring — a piece of information that is presented with the same formatting and apparent relevance as critical inputs, but is either irrelevant to the correct solution or would lead to an incorrect answer if used naively.
Requires the model to make at least two material assumptions — choices where a reasonable expert might go either way, and the correct answer depends on which assumption is made. The task should require these assumptions to be stated explicitly.
Contains numerical data at realistic scales and precisions — not round numbers, not toy examples, not data that has been obviously simplified for pedagogical purposes.

Format requirements:

Written in realistic professional language — as if it were an internal memo, a consulting brief, a technical specification, or an email from a senior colleague. Not in the style of a textbook exercise or exam question.
Does not telegraph difficulty, hint at where the traps are, or use language like "be careful about..." or "note that..." in ways that flag the embedded challenges.
Specifies a concrete deliverable format — the model must produce structured output (e.g., a table of results, a set of recommendations with supporting calculations, a decision matrix, a parameter set). Not just a prose essay.
Specifies a concrete deadline or context for the deliverable that implies what level of rigor and completeness is expected.

Anti-gaming requirements:

The prompt must not be solvable by a model that simply generates the most common or expected answer for tasks in this domain. The correct answer must be specific to the particular data provided.
The prompt must not be solvable by a model that ignores the data and reasons purely from domain knowledge. The data must be essential to the solution.
The prompt must not be solvable by a model that processes each piece of information independently without integrating them. Cross-referencing between different parts of the prompt must be required.

3.3 Embedded Difficulty Features — Designer's Key

This section is not shown to the model under evaluation. It is the answer key for the benchmark designer and evaluator. Document the following with precise references to the prompt text:

3.3.1 Implicit Constraints

For each implicit constraint:

Location: Quote or reference the specific text in the prompt where the constraint is implied
Nature: What is the constraint?
Correct handling: What should a correct solution do with this constraint?
Common incorrect handling: What will a model likely do instead?
Impact of getting it wrong: How does mishandling this constraint affect the final answer? Quantify if possible.

3.3.2 Data Inconsistencies

For each inconsistency:

Location: Identify the two (or more) conflicting pieces of information
Nature of the conflict: What specifically is inconsistent?
Correct resolution: What should an expert do? (Flag it, investigate, choose one source with justification, average, etc.)
Model's likely behavior: What will a frontier model probably do?
Impact: How does incorrect resolution affect downstream results?

3.3.3 Red Herrings

For each red herring:

Location: Where is it in the prompt?
Why it looks relevant: What feature of its presentation or content makes it attractive to use?
Why it is not relevant: What domain knowledge or careful analysis reveals it should be ignored or deprioritized?
Model's likely behavior: Will the model use it? How will it use it incorrectly?
Impact: What goes wrong if it is used?

3.3.4 Required Assumptions

For each assumption the solver must make:

What decision point requires an assumption?
What is the most defensible assumption and why?
What alternative assumptions are reasonable?
What assumption will a model likely make (if any) and will it state the assumption explicitly?
How much does the final answer change under different assumptions? Provide the range.

3.3.5 Predicted Confident Errors

Identify at least four specific points in the solution where a frontier model is likely to produce a confident, specific, wrong answer. For each:

What step in the reasoning chain?
What is the likely wrong answer?
What is the correct answer?
Why is the model likely to get this wrong? (Map to a specific failure mode from Phase 1.)
How confident will the model likely appear?

PHASE 4 — GOLDEN SOLUTION

4.1 Objective

Produce a complete, expert-level reference solution to the task prompt from Phase 3. This solution serves as the ground truth against which model outputs will be evaluated. It must be good enough that a peer reviewer in the relevant domain would accept it as correct and well-reasoned.

4.2 Requirements

The golden solution must:

Follow the exact output format specified in the task prompt. If the prompt asks for a table and a recommendation memo, the golden solution must contain a table and a recommendation memo.
Explicitly state every assumption made, at the point where it is made, with a brief justification for why that assumption was chosen over alternatives.
Show all intermediate reasoning steps. Do not skip any step that you would expect a model to skip. In particular, show:
- All unit conversions
- All intermediate numerical results (not just the final answer)
- All logical inferences that connect one step to the next
- All places where you cross-referenced two or more pieces of input data
Handle every edge case, inconsistency, and ambiguity in the input data. For each one, explain what you noticed, what you considered, and what you decided.
Flag dependency on assumptions. Wherever the final answer or an intermediate result would change under a different reasonable assumption, state this explicitly and, where feasible, provide the alternative result.
Arrive at a correct, verifiable final answer (or a set of answers conditional on stated assumptions).
Include a brief sensitivity analysis or robustness check — at minimum, identify which inputs the final answer is most sensitive to and by how much.

4.3 Numerical Standards

Show calculations to sufficient precision that rounding is not a source of ambiguity. If the final answer is "$4.2M," show enough intermediate steps that a reviewer can confirm whether the exact value is $4,187,000 or $4,243,000.
Where standard formulas are used, state the formula, name it, and cite any conditions or limitations on its applicability.
Where an approximation is used, state that it is an approximation, explain why it is acceptable, and bound the approximation error.

4.4 Format

Structure the golden solution with clear section headers corresponding to the major reasoning steps. Use numbered steps within sections. Present numerical results in tables where appropriate. End with a clearly labeled "Final Deliverable" section that contains exactly what the task prompt requested.

PHASE 5 — EVALUATION RUBRIC AND AUTOMATED TESTS

5.1 Rubric Design

5.1.1 Structure

Organize the rubric into sections corresponding to the major reasoning stages of the task (not arbitrary categories like "clarity" or "thoroughness"). Each section should evaluate a coherent chunk of the reasoning chain.

For each rubric section:

| Field | Requirement | |---|---| | Section title | Descriptive of the reasoning stage | | What is being evaluated | Precise statement of the capability or output being assessed | | Full credit | Specific description of what a response must contain or demonstrate to earn full marks. Include acceptable numerical ranges where applicable. | | Partial credit tiers | At least two intermediate levels between full credit and zero, with specific criteria for each | | Zero credit | What earns no points — distinguish between "wrong answer" and "no attempt" if relevant | | Point value | How many points this section is worth | | Targeted failure mode(s) | Which failure mode(s) from Phase 1 this criterion is designed to detect | | Scoring notes | Any special instructions for the evaluator — e.g., how to handle a response that gets the right answer through wrong reasoning, or vice versa |

5.1.2 Mandatory Rubric Features

The rubric must:

Total to exactly 100 points
Allocate points to reflect genuine difficulty and importance, not output length or verbosity. A critical one-line insight that unlocks the rest of the solution should be worth more than a page of boilerplate.
Include at least one criterion (worth ≥5 points) that specifically penalizes confident wrong answers more harshly than acknowledged uncertainty. That is, saying "I am confident the answer is X" when X is wrong should score lower than saying "The answer depends on assumption A; if A, then X; if B, then Y."
Include at least one criterion (worth ≥5 points) that rewards correct identification of ambiguity, missing information, or inconsistency in the input data, independent of whether the final answer is correct.
Include at least one criterion that evaluates internal consistency — whether the model's intermediate steps, stated assumptions, and final answer are mutually compatible.
Be specific enough that two competent independent reviewers would agree within 10 points on any given response. If a criterion requires subjective judgment, provide anchoring examples.

5.1.3 Point Allocation Guidance

Rough allocation targets (adjust based on task structure, but justify deviations):

| Category | Target Allocation | |---|---| | Problem setup and assumption identification | 15–20 points | | Core analytical reasoning (the hard multi-step part) | 35–45 points | | Handling of inconsistencies and ambiguities | 10–15 points | | Numerical accuracy of final results | 15–20 points | | Output format, communication, and appropriate caveats | 5–10 points |

5.2 Automated Checks (Unit Tests)

Design at least twelve unit tests that can be run programmatically against a model's output. These are meant to supplement human review, not replace it.

For each test:

| Field | Requirement | |---|---| | Test ID | e.g., T01, T02, etc. | | What is checked | Precise description | | Checking logic | Described in enough detail that a Python programmer could implement it without further clarification. Include regex patterns, numerical tolerances, or keyword lists as appropriate. | | Expected correct output | What a correct response would contain | | Failure signature | What an incorrect response looks like for this test | | Score impact | How many points are gained or lost | | Test type | Hard check (binary pass/fail) or Soft check (graded, with scoring tiers) | | Failure mode detected | Which Phase 1 failure mode this test is designed to catch |

Ensure the automated checks cover at least the following categories:

Presence and explicitness of required assumptions (at least 2 tests)
Numerical accuracy of key intermediate values (at least 3 tests, with specified tolerances)
Internal consistency between stated assumptions and computed results (at least 2 tests)
Correct identification of data issues (at least 1 test)
Output format compliance (at least 1 test)
Consistency between conclusion and reasoning (at least 1 test)
Absence of red herring incorporation into the final answer (at least 1 test)

5.3 Score Interpretation and Expected Performance

Define three performance tiers with non-overlapping score ranges:

| Tier | Score Range | Characteristics | |---|---|---| | Strong | Define range | Describe what a response in this tier looks like — which things it gets right, what level of reasoning it demonstrates | | Moderate | Define range | Describe | | Weak | Define range | Describe |

Then provide:

Predicted frontier model performance: Which tier, what approximate score range, and a specific explanation of which rubric sections the model would likely score well on and which it would likely fail.
Predicted expert human performance: Which tier, what approximate score range, and what the main sources of human error would be (if any).
Discriminative power assessment: How well does this scoring system separate genuinely good reasoning from superficially plausible output? Where is the rubric most and least effective?

PHASE 6 — FOLLOW-UP INTERACTION PROTOCOL

6.1 Objective

Design a structured set of exactly six follow-up prompts to be used in a second-round interactive evaluation after the model produces its initial response. These prompts probe whether the model can recover from errors, recognize its own mistakes, and update its reasoning coherently.

6.2 Requirements per Follow-Up

For each of the six follow-up prompts:

| Field | Requirement | |---|---| | Follow-up ID | F1 through F6 | | Exact prompt text | Written as it would be delivered to the model, in natural language. Must not reveal the correct answer. | | Capability probed | What specific ability is this testing? (e.g., error detection, numerical correction, assumption revision, consistency checking) | | Prerequisite | What error or feature of the initial response triggers this follow-up? (e.g., "Use if the model failed to identify the data inconsistency") | | Predicted model behavior | What will a frontier model likely do in response? Will it recover, partially recover, or continue to fail? | | Recovery mechanism | If recovery is possible, explain what cognitive or computational operation the model needs to perform. If recovery is unlikely, explain what prevents it. | | Diagnostic value | Low (confirms what we already know from the initial response) / Medium (provides additional signal about reasoning depth) / High (distinguishes between superficial and genuine understanding) | | Scoring | How does the response to this follow-up affect the overall score? (Additive points, no effect, or penalty?) |

6.3 Mandatory Follow-Up Types

The six follow-ups must include at least one of each of the following types:

Indirect error signal: Hints at an error without stating what the error is. Tests whether the model can locate and fix its own mistake given a vague cue. (e.g., "A colleague reviewed your analysis and thinks one of the intermediate values might be off. Can you double-check your work?")
Assumption challenge: Asks the model what would change if a specific assumption were different. Tests whether the model actually tracked its own assumptions or just stated them as decoration. (e.g., "What if we used X instead of Y for that parameter? How would your final recommendation change?")
Numerical correction injection: Provides the correct value for one intermediate step and asks the model to propagate the correction through to the final answer. Tests whether the model can perform targeted revision without starting over. (e.g., "Actually, the correct value for step 3 is Z. Please update your remaining calculations accordingly.")
Consistency probe: Points out a specific inconsistency between two parts of the model's own response and asks it to resolve the conflict. (e.g., "In section 2 you assumed X, but in section 4 you used Y. Which is correct, and what changes?")
Underdetermination probe: Asks about a degree of freedom in the problem that the model may have resolved silently. (e.g., "You gave a single answer, but doesn't the result depend on how we interpret the Q3 revenue figures? What range of answers is possible?")
Metacognitive assessment: Asks the model to rate its own confidence in different parts of its answer and identify where it is most likely to be wrong. (e.g., "Which part of your analysis are you least confident in, and why?")

6.4 Phase 6 Synthesis

Conclude with a structured summary:

Which failure modes from Phase 1 proved resilient to follow-up prompting (the model could not recover even with hints)?
Which failure modes were partially recoverable through interaction?
What does this imply about the difference between capability limitations (the model cannot do this) vs. activation failures (the model can do this but did not)?
How should the interactive results be weighted relative to the initial response in an overall evaluation?

PHASE 7 — CAPABILITY STRESS ANALYSIS

7.1 Objective

Write a technical analysis suitable for inclusion in the methods section of a benchmark paper. The audience is ML researchers and evaluation scientists.

7.2 Required Content

Address each of the following in dedicated subsections:

7.2.1 Capability Mapping

For each distinct capability this task requires (enumerate them — aim for 5–8), explain:

What the capability is
Which specific step(s) of the task require it
What is known about current transformer-based models' ability to perform it
Whether this capability is tested by any existing major benchmark, and if so, how this task's demands differ

7.2.2 Scale Insufficiency Argument

For each failure mode this task targets, provide a principled argument for why simply training a larger model on more data is unlikely to resolve it. Acceptable argument types include:

Information-theoretic arguments (the training data does not contain the relevant signal)
Architectural arguments (the failure relates to a structural limitation of autoregressive generation, finite context windows, or single-pass inference)
Objective function arguments (the training objective does not incentivize the correct behavior)
Distribution arguments (the task requires out-of-distribution generalization that scale does not provide)

Unacceptable: vague assertions like "this requires real understanding" or "scale can't solve everything."

7.2.3 Theoretical Remediation

What changes to model architecture, training methodology, inference procedure, or tool integration would theoretically be needed for a model to consistently solve this task correctly? Be specific. Distinguish between:

Changes that would help and are likely feasible in the near term (1–2 years)
Changes that would help but face significant research obstacles
Changes that would require fundamental advances

7.2.4 Benchmark Differentiation

Explicitly compare this task to the following existing benchmarks and explain what this task measures that each one does not:

MATH / GSM8K (mathematical reasoning)
HumanEval / MBPP (code generation)
MMLU / GPQA (knowledge and reasoning breadth)
BIG-Bench Hard (diverse challenging tasks)
HELM (holistic evaluation)
SWE-bench (real software engineering)
Any domain-specific benchmark relevant to your chosen task domain

For each comparison, the claim must be specific: "MATH tests X but not Y; this task requires Y because of Z."

7.2.5 Ceiling Effect and Gaming Analysis

Address these risks explicitly:

Ceiling effect via memorization: Could a model score well by having memorized a similar problem from its training data? How does the task design guard against this?
Ceiling effect via surface heuristics: Could a model achieve a passing score by producing well-formatted output with reasonable-sounding numbers, without actually performing the required reasoning? How does the rubric guard against this?
Floor effect via excessive difficulty: Is there a risk that all models score near zero, providing no useful signal? How does the rubric provide granularity in the low-scoring range?
Construct validity: Does the task actually measure what it claims to measure? Could a model fail for reasons unrelated to the targeted failure modes (e.g., context length, output length limits, formatting confusion)?

EXECUTION STANDARDS

These standards apply to your entire output. Treat them as hard constraints, not aspirations.

Specificity over impressiveness. Every claim must be backed by a concrete example, a number, or a precise mechanism. "Models struggle with X" is not acceptable. "Models produce output Y when the correct answer is Z because of mechanism W" is acceptable.
Honest capability assessment. Do not overstate model weaknesses to make your benchmark seem more valuable. If a frontier model would likely handle a particular sub-step correctly, say so and explain why the task is still hard overall.
Professional realism. The task prompt must read like a document from a real workplace. Use realistic terminology, realistic data magnitudes, realistic levels of ambiguity, and realistic stakes. Do not use toy numbers, perfectly round figures, or oversimplified scenarios unless realism demands them.
Justified design. For every embedded difficulty feature (inconsistency, red herring, implicit constraint), explain why you chose that specific technique, why it is likely to expose a genuine model weakness, and why it would not be perceived as unfair by a reasonable evaluator.
Structural difficulty, not surface confusion. The task must be hard because the reasoning is hard, not because the prompt is poorly written, ambiguously formatted, or deliberately confusing. A domain expert should be able to read the prompt and immediately understand what is being asked, even if producing the correct answer requires significant work.
Reproducibility. Your output must be complete enough that another researcher could run this benchmark — administer the prompt, evaluate a response, score it, and interpret the score — without needing to contact you for clarification.
Internal consistency. Your golden solution must actually solve the prompt you wrote. Your rubric must actually evaluate the outputs your prompt requests. Your automated tests must actually check the features your rubric scores. Verify this explicitly.

OUTPUT FORMAT AND SEQUENCING

Deliver your response in clearly labeled sections corresponding to Phases 1 through 7. Complete each phase fully before beginning the next. Do not combine phases or provide abbreviated versions with promises to elaborate later.

Required section headers:

## PHASE 1 — FAILURE MODE ANALYSIS
## PHASE 2 — CANDIDATE TASK PROPOSALS
## PHASE 3 — TASK DEVELOPMENT
### 3.2 Task Prompt
### 3.3 Embedded Difficulty Features
## PHASE 4 — GOLDEN SOLUTION
## PHASE 5 — EVALUATION RUBRIC AND AUTOMATED TESTS
### 5.1 Rubric
### 5.2 Automated Checks
### 5.3 Score Interpretation
## PHASE 6 — FOLLOW-UP INTERACTION PROTOCOL
## PHASE 7 — CAPABILITY STRESS ANALYSIS

Length guidance: The complete output should be detailed enough to serve as a standalone benchmark specification document. Phase 3 (task prompt + difficulty features) and Phase 4 (golden solution) will be the longest sections. Phase 5 (rubric + tests) should be exhaustively detailed. Do not sacrifice precision for brevity.

Begin with Phase 1. Proceed sequentially.