STEM Fixture-Based Validation Strategy
Provenance
Ingested from C:\Users\mesha\Downloads\stem-fixture-based-validation-strategy.md on 2026-03-28. Original file ~47KB. Full content included below.
Comprehensive Fixture-Based Validation Strategy for STEM Submission Quality
Design Philosophy
The central insight driving this fixture set is that real-world submissions are not cleanly separable into "good" and "bad" categories. They are messy hybrids where multiple issues interact, where correctness in one dimension masks failures in another, and where the line between a blocking error and a tolerable imperfection depends on context. A validator that only handles clean dichotomies will fail in production — not because it gets things wrong, but because authors will learn to ignore it.
The fixture set below is structured around failure patterns, not quality labels. Each fixture encodes a specific way that submissions break in practice, drawn from recurring patterns in STEM content authoring. The goal is to stress-test a validator against the situations that actually matter: rubrics that drift from solutions, formatting that breaks downstream rendering, reasoning that arrives at correct answers through incorrect logic, constraints that contradict each other silently, and metadata that is technically parseable but practically useless.
Two principles govern the assertion strategy. First, assertions should be slightly fuzzy — checking verdict tier and the presence of key diagnostic flags rather than exact-matching against a complete list of failures. Exact-match assertions are brittle and create maintenance burden disproportionate to their diagnostic value. Second, the fixtures are organized so that the validator's behavior on edge cases reveals its calibration: an overly strict validator will flag the golden paths, an overly lenient one will pass the compounding-failure case, and a poorly calibrated one will conflate subjective quality issues with objective errors.
The Fixture Set
Fixture 1: physics-golden-path.json
Domain: Physics
Topic: Projectile Motion
Pattern: Clean submission that should pass without flags. Exists to confirm the validator does not over-flag well-constructed content.
{
"domain": "physics",
"topic": "projectile motion",
"problem_statement": "A ball is launched at 30° above horizontal with initial speed 20 m/s. Find the maximum height reached. Assume g = 9.8 m/s².",
"solution": {
"steps": [
"Identify vertical component: v_y = 20 * sin(30°) = 10 m/s",
"Use v² = v₀² - 2gh, set v = 0 at max height",
"h = v₀² / (2g) = 100 / 19.6 ≈ 5.1 m"
],
"final_answer": "5.1 m"
},
"rubric": [
{ "criterion": "Correct velocity decomposition", "points": 2 },
{ "criterion": "Correct kinematic equation applied", "points": 2 },
{ "criterion": "Correct final answer with units", "points": 1 }
]
}
What makes this a golden path: Every rubric criterion maps directly to a solution step. Units are present throughout. The mathematical chain is internally consistent. LaTeX is not used, so no formatting risk. The problem statement provides all necessary constants.
Expected output: Verdict GO. No significant failure flags. If the validator flags anything here, it is miscalibrated and will erode author trust immediately.
Assertion strategy:
-
Assert verdict is
GO -
Assert failure flag count is 0 or contains only informational-level notes
Fixture 2: physics-partial-credit-mismatch.json
Domain: Physics
Topic: Projectile Motion
Pattern: Rubric criteria do not map to solution methodology. This happens frequently when a problem is revised after the rubric is written, or when a rubric is copied from a similar but non-identical problem.
{
"domain": "physics",
"topic": "projectile motion",
"problem_statement": "A ball is launched at 30° above horizontal with initial speed 20 m/s. Find the total time of flight.",
"solution": {
"steps": [
"Vertical component: v_y = 20 * sin(30°) = 10 m/s",
"Time to apex: t = v_y / g = 10 / 9.8 ≈ 1.02 s",
"Total time: 2t ≈ 2.04 s"
],
"final_answer": "2.04 s"
},
"rubric": [
{ "criterion": "Correct use of energy conservation", "points": 3 },
{ "criterion": "Correct final answer", "points": 2 }
]
}
What is broken: The rubric references energy conservation, but the solution uses kinematics. These are both valid approaches to projectile motion problems, but this rubric was clearly written for a different variant (likely the maximum height version, where energy methods are more natural). A student following the rubric guidance would be confused about what approach is expected. Additionally, the rubric has only 2 criteria for a 3-step solution, making partial credit assignment ambiguous.
Why this matters: Rubric-solution misalignment is one of the most common and most damaging submission errors. It does not produce an obviously wrong answer. It does not violate any schema. It only becomes visible when someone tries to actually use the rubric to grade student work, at which point it causes inconsistent grading and student complaints.
Expected output: Verdict REVIEW. Key flags should include rubric-solution alignment failure and partial credit mapping gap.
Assertion strategy:
-
Assert verdict is
REVIEW -
Assert presence of a rubric alignment flag
-
Do not assert exact flag list — the validator may also flag the point distribution imbalance, which is acceptable but not required
Fixture 3: physics-latex-formatting-edge-case.json
Domain: Physics
Topic: Electric Potential
Pattern: Mathematically valid content with malformed LaTeX that breaks rendering. This is historically one of the most common sources of validator bypass — the content looks fine in the authoring tool but renders incorrectly downstream.
{
"domain": "physics",
"topic": "electric potential",
"problem_statement": "Find the electric potential at distance r from a point charge Q. Use $V = \\frac{kQ}{r$.",
"solution": {
"steps": [
"Apply Coulomb potential formula",
"V = \\frac{kQ}{r} where k = 8.99 \\times 10^9 N\\cdot m^2/C^2",
"Substitute values to get numerical result"
],
"final_answer": "V = kQ/r"
},
"rubric": [
{ "criterion": "Correct formula stated", "points": 2 },
{ "criterion": "Correct substitution", "points": 2 },
{ "criterion": "Units included", "points": 1 }
]
}
What is broken: The problem statement contains $V = \\frac{kQ}{r$ — a closing brace is missing inside the LaTeX delimiter, which will cause rendering failure in any standard LaTeX processor. The solution steps mix escaped LaTeX (\\frac, \\times, \\cdot) with unescaped or partially escaped notation inconsistently. Step 3 says "substitute values" but no values are actually substituted, making the rubric criterion for "correct substitution" unverifiable.
Why this matters: LaTeX formatting errors are invisible in plain-text review but catastrophic in rendered output. They are also the single most common class of issue that validators miss because content-level analysis skips formatting validation. A validator that does not catch malformed LaTeX is not protecting the rendering pipeline.
Expected output: Verdict REVIEW. Key flags should include malformed LaTeX in problem statement and incomplete solution step.
Assertion strategy:
-
Assert verdict is
REVIEW -
Assert presence of a LaTeX parsing or formatting flag
-
Optionally assert a flag for solution completeness (step 3 is a placeholder, not a real step)
Fixture 4: chemistry-golden-path.json
Domain: Chemistry
Topic: Stoichiometry
Pattern: Clean submission with balanced equation, dimensional analysis, consistent units, and rubric that mirrors solution steps.
{
"domain": "chemistry",
"topic": "stoichiometry",
"problem_statement": "How many grams of CO₂ are produced when 10 g of CH₄ are completely combusted? (CH₄ + 2O₂ → CO₂ + 2H₂O, M(CH₄) = 16 g/mol, M(CO₂) = 44 g/mol)",
"solution": {
"steps": [
"Moles of CH₄ = 10 / 16 = 0.625 mol",
"Moles of CO₂ = 0.625 mol (1:1 ratio from balanced equation)",
"Mass of CO₂ = 0.625 × 44 = 27.5 g"
],
"final_answer": "27.5 g"
},
"rubric": [
{ "criterion": "Correct mole calculation for CH₄", "points": 2 },
{ "criterion": "Correct molar ratio applied from balanced equation", "points": 1 },
{ "criterion": "Correct final mass with units", "points": 2 }
]
}
What makes this a golden path: The balanced equation is provided in the problem statement. Molar masses are given explicitly, removing ambiguity. Each solution step corresponds to exactly one rubric criterion. Units are present at every stage. The final answer includes units.
Expected output: Verdict GO. No significant failure flags.
Assertion strategy:
-
Assert verdict is
GO -
Assert failure flag count is 0 or informational only
Fixture 5: chemistry-units-missing.json
Domain: Chemistry
Topic: Stoichiometry
Pattern: Correct reasoning and correct numerical answer, but units are absent throughout. The rubric does not explicitly require units, creating ambiguity about whether this is a failure.
{
"domain": "chemistry",
"topic": "stoichiometry",
"problem_statement": "How many grams of CO₂ are produced when 10 g of CH₄ are completely combusted?",
"solution": {
"steps": [
"Moles of CH₄ = 10 / 16 = 0.625",
"Moles of CO₂ = 0.625",
"Mass of CO₂ = 0.625 * 44 = 27.5"
],
"final_answer": "27.5"
},
"rubric": [
{ "criterion": "Correct mole calculation", "points": 2 },
{ "criterion": "Correct ratio", "points": 1 },
{ "criterion": "Correct final answer", "points": 2 }
]
}
What is broken: Units are absent in all solution steps and the final answer. The problem statement asks for "grams" but the solution never explicitly uses the unit "g" or "mol" — the numbers are correct but dimensionless. More subtly, the rubric says "correct final answer" but does not specify that units are required. This creates a grading ambiguity: should a student who writes "27.5" without "g" receive full credit?
Why this matters: This is one of the most common real-world submissions. Authors who are fluent in the domain often drop units because they are "obvious." The validator needs to flag this, but the verdict should be REVIEW rather than NO-GO because the content is not wrong — it is incomplete in a way that requires a human judgment call about grading standards.
Expected output: Verdict REVIEW. Key flags should include missing units in solution and rubric does not specify unit requirements.
Assertion strategy:
-
Assert verdict is
REVIEW -
Assert presence of a units-related flag
-
Optionally assert a rubric specificity flag
Fixture 6: chemistry-correct-answer-wrong-path.json
Domain: Chemistry
Topic: Limiting Reagent
Pattern: The final answer appears to be stated with confidence, and the numerical intermediate steps are correct, but the reasoning contains an error. This is the hardest case for automated validators because surface-level checks (format, schema, numerical consistency) all pass.
{
"domain": "chemistry",
"topic": "limiting reagent",
"problem_statement": "10 g of H₂ reacts with 80 g of O₂ to form water. Which is the limiting reagent? (2H₂ + O₂ → 2H₂O)",
"solution": {
"steps": [
"Moles of H₂ = 10 / 2 = 5 mol",
"Moles of O₂ = 80 / 32 = 2.5 mol",
"Ratio needed: 2:1, so need 5 mol H₂ per 2.5 mol O₂",
"O₂ is limiting because we have less of it"
],
"final_answer": "O₂ is the limiting reagent"
},
"rubric": [
{ "criterion": "Correct mole conversions", "points": 2 },
{ "criterion": "Correct identification of limiting reagent", "points": 3 }
]
}
What is broken: The mole calculations are correct (5 mol H₂, 2.5 mol O₂). The stoichiometric ratio is correctly stated as 2:1. But step 4's reasoning — "O₂ is limiting because we have less of it" — is wrong. Having fewer moles does not make a reagent limiting; the comparison must account for stoichiometric requirements. In this case, 5 mol H₂ requires 2.5 mol O₂ (from the 2:1 ratio), and we have exactly 2.5 mol O₂, so the reagents are stoichiometrically matched — neither is truly limiting. The final answer is therefore either wrong or at best misleading.
Why this matters: This pattern — correct numbers, incorrect reasoning, plausible-looking conclusion — is the single most dangerous failure mode for educational content. A student who reads this solution will internalize the wrong heuristic ("whichever you have less of is limiting"). A validator that only checks numerical consistency or answer format will miss this entirely. This fixture tests whether the validator can detect logical inconsistency between stated reasoning and the mathematical setup.
Expected output: Verdict NO-GO. Key flags should include reasoning inconsistency and incorrect or unsupported final answer.
Assertion strategy:
-
Assert verdict is
NO-GO -
Assert presence of a reasoning or logic flag
-
Assert at least 2 FAIL-level issues
Fixture 7: biology-overly-vague-rubric.json
Domain: Biology
Topic: Cellular Respiration
Pattern: Schema-valid, no missing fields, correct content, but the rubric is too vague to be actionable for grading. This is a subjective quality failure, not an objective error.
{
"domain": "biology",
"topic": "cellular respiration",
"problem_statement": "Explain how cells produce ATP during aerobic respiration. Include the three main stages.",
"solution": {
"steps": [
"Glycolysis occurs in the cytoplasm, producing 2 ATP and 2 pyruvate",
"Pyruvate oxidation and Krebs cycle in mitochondrial matrix produce NADH, FADH₂, and 2 ATP",
"Electron transport chain on inner mitochondrial membrane produces ~32 ATP via oxidative phosphorylation"
],
"final_answer": "Aerobic respiration produces approximately 36-38 ATP total across three stages: glycolysis, the Krebs cycle, and the electron transport chain."
},
"rubric": [
{ "criterion": "Mentions all three stages", "points": 3 },
{ "criterion": "Shows understanding of the process", "points": 4 },
{ "criterion": "Answer is complete", "points": 3 }
]
}
What is broken: The solution is solid. The problem statement is clear. But the rubric is nearly useless for grading. Criterion 2 ("shows understanding of the process") provides no guidance on what constitutes understanding versus not — a grader would have to invent their own standard. Criterion 3 ("answer is complete") is circular — complete relative to what? There is no partial credit guidance, no indication of what a 2/4 versus a 3/4 looks like for criterion 2.
Why this matters: This fixture is the most important one for calibrating the boundary between REVIEW and NO-GO. The content is not wrong. A grader could, with effort, use this rubric. But it will produce inconsistent grading across graders, which defeats the purpose of having a rubric. The validator should flag this as a quality concern, but it should not block publication — the fix is rubric revision, which is an author judgment call, not a structural repair.
Expected output: Verdict REVIEW. Key flags should include rubric criteria lack specificity and no partial credit guidance.
Assertion strategy:
-
Assert verdict is
REVIEW(notNO-GO— this is a quality issue, not a correctness issue) -
Assert presence of a rubric quality or vagueness flag
-
This fixture is specifically useful for testing that the validator does not over-penalize subjective issues
Fixture 8: engineering-contradictory-constraints.json
Domain: Engineering
Topic: Structural Beam Loading
Pattern: Problem statement contains mutually exclusive conditions. Common when problems are assembled from templates or when constraints are added incrementally without checking consistency.
{
"domain": "engineering",
"topic": "structural beam loading",
"problem_statement": "A simply supported beam of length 4 m carries a uniformly distributed load of 10 kN/m. The beam is fixed at both ends. Find the maximum bending moment.",
"solution": {
"steps": [
"For simply supported beam: M_max = wL²/8",
"M_max = 10 × 16 / 8 = 20 kN·m"
],
"final_answer": "20 kN·m"
},
"rubric": [
{ "criterion": "Correct formula for beam type", "points": 3 },
{ "criterion": "Correct calculation", "points": 2 }
]
}
What is broken: The problem statement says the beam is "simply supported" and also "fixed at both ends." These are mutually exclusive boundary conditions in structural analysis. A simply supported beam under UDL has M_max = wL²/8. A fixed-fixed beam has M_max = wL²/12 at the supports and wL²/24 at midspan. The solution silently assumes simply supported, ignoring the contradiction. The rubric awards points for "correct formula for beam type" — but the beam type is undefined because of the contradictory constraints.
Why this matters: Contradictory constraints are a NO-GO because there is no correct answer — any solution must ignore part of the problem statement, and the rubric cannot grade consistently against an ambiguous setup. This is distinct from the vague rubric case (Fixture 7) where the content is usable but imprecise. Here, the content is structurally broken.
Expected output: Verdict NO-GO. Key flags should include contradictory constraints in problem statement and indeterminate correct answer.
Assertion strategy:
-
Assert verdict is
NO-GO -
Assert presence of a constraint conflict or problem inconsistency flag
-
Assert at least 1 FAIL-level issue
Fixture 9: engineering-missing-metadata-edge-case.json
Domain: Engineering
Topic: Pressure Vessel Design (topic field is null)
Pattern: Content is technically sound, but required metadata fields are absent or null, and the solution makes unstated assumptions. Tests both schema validation and content completeness checks.
{
"domain": "engineering",
"topic": null,
"problem_statement": "Design a cylindrical pressure vessel to hold 500 kPa internal pressure with a safety factor of 2. Wall thickness should use thin-wall approximation. Inner radius = 0.5 m.",
"solution": {
"steps": [
"Design pressure = 500 × 2 = 1000 kPa",
"t = Pr / (2σ_allowable)",
"Assuming σ_allowable = 250 MPa: t = (1000 × 10³ × 0.5) / (2 × 250 × 10⁶) = 0.001 m = 1 mm"
],
"final_answer": "Minimum wall thickness = 1 mm"
},
"rubric": [
{ "criterion": "Correct application of safety factor", "points": 2 },
{ "criterion": "Correct thin-wall formula", "points": 2 },
{ "criterion": "Correct numerical result", "points": 1 }
]
}
What is broken: The topic field is null, which may violate schema requirements depending on the spec. More importantly, the solution assumes σ_allowable = 250 MPa without specifying the material. The problem statement does not mention material selection. The calculation is internally consistent given the assumption, but a student working this problem would need to know what material is assumed, and a grader could not evaluate "correct numerical result" without knowing the intended material. This also raises the thin-wall approximation validity: t/r = 1/500 = 0.002, which is well within the thin-wall regime, so that particular assumption is fine.
Expected output: Verdict REVIEW. Key flags should include null required field and unstated material assumption.
Assertion strategy:
-
Assert verdict is
REVIEW -
Assert presence of a schema or metadata flag
-
Assert presence of an assumption or completeness flag
Fixture 10: cross-domain-hybrid-failure.json
Domain: Physics (Thermodynamics)
Topic: Isothermal Expansion
Pattern: Multiple small issues across solution, rubric, and conventions that are individually minor but compound into an unreliable submission. This is the most realistic "production" fixture and the most important one for calibrating validator sensitivity.
{
"domain": "physics",
"topic": "thermodynamics",
"problem_statement": "An ideal gas undergoes isothermal expansion from V₁ = 1 L to V₂ = 3 L at T = 300 K. Calculate the work done by the gas. n = 0.5 mol.",
"solution": {
"steps": [
"For isothermal process: W = nRT ln(V₂/V₁)",
"W = 0.5 × 8.314 × 300 × ln(3)",
"W = 0.5 × 8.314 × 300 × 1.099",
"W ≈ 137 J"
],
"final_answer": "137 J"
},
"rubric": [
{ "criterion": "Correct identification of isothermal work formula", "points": 2 },
{ "criterion": "Correct substitution", "points": 2 },
{ "criterion": "Answer within 5% of 137 J", "points": 1 }
]
}
What is broken — and this requires careful analysis:
The formula W = nRT ln(V₂/V₁) is correct. The substitution is: 0.5 × 8.314 × 300 × ln(3).
Let us compute: 0.5 × 8.314 = 4.157. Then 4.157 × 300 = 1247.1. Then ln(3) ≈ 1.0986. Then 1247.1 × 1.0986 ≈ 1370 J.
The solution claims W ≈ 137 J. This is off by a factor of 10. The arithmetic in step 3 is wrong — 0.5 × 8.314 × 300 × 1.099 ≈ 1370, not 137. This is a decimal point error, likely from computing 0.5 × 8.314 × 300 as 124.71 instead of 1247.1.
The rubric compounds this error by hardcoding "within 5% of 137 J" as the acceptance criterion. This means a student who correctly computes the answer as approximately 1370 J would lose the point, while a student who makes the same decimal error would receive full credit. The rubric has encoded the wrong answer.
Additionally, the problem gives volumes in liters, not cubic meters. For the formula W = nRT ln(V₂/V₁), this does not affect the result because the ratio V₂/V₁ = 3 is dimensionless. However, the use of liters alongside R = 8.314 J/(mol·K) is an SI convention inconsistency that could confuse students who try to verify units.
Why this matters: This fixture tests three things simultaneously. First, can the validator detect an arithmetic error when the formula and substitution are correct but the multiplication is wrong? Second, can the validator detect that a rubric criterion encodes an incorrect reference answer? Third, can the validator flag the compounding nature of these issues — the arithmetic error alone might be a REVIEW, but the combination of wrong arithmetic plus a rubric that codifies the wrong answer makes this a NO-GO because the entire grading pipeline is built on an incorrect foundation.
Expected output: Verdict NO-GO. Key flags should include arithmetic error in solution, rubric encodes incorrect reference answer, and optionally unit convention inconsistency.
Assertion strategy:
-
Assert verdict is
NO-GO -
Assert presence of an arithmetic or calculation error flag
-
Assert presence of a rubric correctness flag
-
Assert at least 2 FAIL-level issues
-
If the validator returns
REVIEW, it is being too lenient — this is the calibration test
Summary Table
| # | Fixture | Domain | Failure Pattern | Expected Verdict | Key Assertion Flags |
|---|---------|--------|----------------|-----------------|-------------------|
| 1 | physics-golden-path | Physics | None (clean) | GO | No failures |
| 2 | physics-partial-credit-mismatch | Physics | Rubric-solution misalignment | REVIEW | rubric_alignment |
| 3 | physics-latex-formatting-edge-case | Physics | Malformed LaTeX | REVIEW | latex_parse_error |
| 4 | chemistry-golden-path | Chemistry | None (clean) | GO | No failures |
| 5 | chemistry-units-missing | Chemistry | Units absent throughout | REVIEW | units_missing |
| 6 | chemistry-correct-answer-wrong-path | Chemistry | Correct answer, broken reasoning | NO-GO | reasoning_inconsistency |
| 7 | biology-overly-vague-rubric | Biology | Unactionable rubric | REVIEW | rubric_vagueness |
| 8 | engineering-contradictory-constraints | Engineering | Mutually exclusive conditions | NO-GO | constraint_conflict |
| 9 | engineering-missing-metadata | Engineering | Null fields, unstated assumptions | REVIEW | schema_deviation, unstated_assumption |
| 10 | cross-domain-hybrid-failure | Physics/Thermo | Compounding arithmetic + rubric errors | NO-GO | arithmetic_error, rubric_incorrect_answer |
Evaluation Framework
Assertion Strategy: Fuzzy, Not Exact
For each fixture, automated test assertions should follow this pattern:
def test_fixture(validator_output, expected):
# Level 1: Verdict must match exactly
assert validator_output.verdict == expected.verdict
# Level 2: Key flags must be present (subset check, not exact match)
for flag in expected.required_flags:
assert flag in validator_output.flags
# Level 3: For NO-GO verdicts, require minimum failure count
if expected.verdict == "NO-GO":
fail_count = sum(1 for f in validator_output.flags if f.severity == "FAIL")
assert fail_count >= expected.min_fail_count
This approach avoids brittleness. The validator may identify additional issues beyond the key flags — that is acceptable and even desirable. What matters is that it catches the critical issues and assigns the correct verdict tier.
Precision and Recall: Tiered by Error Type
Mixing objective errors and subjective quality issues in a single precision/recall calculation produces noisy, misleading metrics. Instead, track them separately:
Tier 1 — Objective Errors:
-
Schema violations (null required fields, type mismatches)
-
Mathematical inconsistencies (arithmetic errors, formula misapplication)
-
Logical contradictions (mutually exclusive constraints, reasoning that contradicts its own setup)
-
Formatting blockers (malformed LaTeX that prevents rendering)
Tier 2 — Subjective Quality Issues:
-
Rubric vagueness or lack of specificity
-
Missing units where convention expects them but rubric does not require them
-
Unstated assumptions that are reasonable but not explicit
-
Partial credit guidance gaps
Precision and recall for Tier 1 should be high (target: precision ≥ 0.9, recall ≥ 0.85). These are errors with clear ground truth. Precision and recall for Tier 2 will inherently be lower because human annotators disagree on these issues. Target precision ≥ 0.7, recall ≥ 0.6, and monitor inter-annotator agreement as a ceiling on achievable performance.
The Missing Metric: Author Friction
Beyond precision and recall, track one metric that captures real-world validator effectiveness: author override rate. This is the frequency with which authors see a validator flag and either dismiss it without action or override it to force publication.
A validator with high precision and recall but a high override rate is failing. It means the flags are technically correct but practically unhelpful — either too noisy, too vaguely described, or flagging issues authors consider unimportant. This metric is the single best indicator of whether the validator is building trust or eroding it.
Track override rate separately for REVIEW and NO-GO verdicts. A high override rate on REVIEW may be acceptable (authors are making informed judgment calls). A high override rate on NO-GO is a serious problem — either the validator is miscalibrated or the escalation path is broken.
Implementation Recommendations
Phase 1: Bootstrap with These 10 Fixtures
Run all 10 fixtures through the validator. Check verdicts and key flags against expectations. This gives immediate signal on calibration without requiring a large test corpus.
Pay special attention to three diagnostic cases:
-
Fixtures 1 and 4 (golden paths): If either produces a
REVIEWorNO-GO, the validator is over-sensitive. Fix this first, because false positives destroy author trust faster than false negatives miss real issues. -
Fixture 10 (hybrid failure): If this produces
REVIEWinstead ofNO-GO, the validator is under-sensitive to compounding errors. If it producesNO-GObut only flags one issue, it is missing the interaction between the arithmetic error and the rubric encoding the wrong answer. -
Fixture 7 (vague rubric): If this produces
NO-GO, the validator is conflating quality concerns with correctness errors. This distinction is critical for author experience.
Phase 2: Manual Comparison on Real Submissions
Before building formal precision/recall infrastructure, run the validator on 10 real submissions from production and manually compare outputs. This is the fastest way to identify systematic blind spots. Look for:
-
Issues the validator flags that a human would not care about (precision problem)
-
Issues a human identifies immediately that the validator misses (recall problem)
-
Cases where the verdict is correct but the flag descriptions are confusing or unhelpful (usability problem)
Phase 3: Expand Fixture Set Based on Findings
After Phase 2, add fixtures that encode the specific failure patterns you observed in real data. Prioritize patterns that:
-
Appear frequently (high base rate in production)
-
Have high impact when missed (lead to student complaints or grading inconsistency)
-
Are close to the decision boundary between verdict tiers (these are where calibration matters most)
Suggested expansion candidates based on common production patterns:
-
Numerically correct but methodologically wrong (physics): Student uses a shortcut or wrong method that happens to give the right answer for specific parameter values but would fail for general cases.
-
Multiple valid answers but rubric assumes one (biology): Open-ended question where the rubric's expected answer is one of several correct responses. Tests whether the validator flags rubric rigidity.
-
Subtle unit conversion error (chemistry): A factor-of-1000 error buried in a multi-step calculation where the final answer is in a plausible range. Tests whether the validator can trace arithmetic through unit conversions.
-
Copy-paste artifact (any domain): Problem statement or solution contains a fragment from a different problem that was not fully cleaned up. Tests whether the validator detects content coherence issues.
Design Rationale: Why This Structure
The fixture set is intentionally asymmetric: 2 golden paths, 5 REVIEW cases, 3 NO-GO cases. This reflects the actual distribution of production submissions, where most issues are gray-area quality concerns rather than clear-cut failures. A fixture set with equal numbers of GO/REVIEW/NO-GO would ## Design Rationale: Why This Structure (continued)
The fixture set is intentionally asymmetric: 2 golden paths, 5 REVIEW cases, 3 NO-GO cases. This reflects the actual distribution of production submissions, where most issues are gray-area quality concerns rather than clear-cut failures. A fixture set with equal numbers of GO/REVIEW/NO-GO would misrepresent the real problem. It would cause the validator to be trained or calibrated against an artificial distribution, leading to threshold choices that produce too many NO-GO verdicts in production and erode the author relationship before the system earns any credibility.
The asymmetry also serves a secondary purpose: it forces the validator to earn its NO-GO verdicts. When only 3 of 10 fixtures are NO-GO, and each NO-GO requires demonstrable multi-factor evidence, the validator cannot achieve a good score by being aggressively strict. An over-eager validator that flags everything as NO-GO will pass all three NO-GO cases but fail both golden paths — and that failure is immediately visible in the test results.
Appendix A: Fixture Edge Cases and Disambiguation Notes
On Fixture 6: Why the Limiting Reagent Answer Is Wrong
This requires careful unpacking because the stoichiometric analysis is genuinely subtle.
The reaction is 2H₂ + O₂ → 2H₂O.
We have 5 mol H₂ and 2.5 mol O₂.
To consume all 5 mol H₂, we need 5/2 = 2.5 mol O₂. We have exactly 2.5 mol O₂.
To consume all 2.5 mol O₂, we need 2.5 × 2 = 5 mol H₂. We have exactly 5 mol H₂.
Both reagents are consumed simultaneously. This is a stoichiometrically perfect ratio — there is no limiting reagent in the technical sense. The reaction goes to completion with nothing left over.
The solution in Fixture 6 concludes that O₂ is limiting "because we have less of it." This heuristic — the reagent with fewer moles is the limiting reagent — is a common student misconception that this fixture is designed to catch. It works by accident in many textbook problems (because textbook problems often give the limiting reagent fewer moles by design), but it is not a valid general principle. The valid principle is: for each reagent, compute how much of the other reagent would be required to consume it completely; the reagent that would run out first is limiting.
In this specific case, the heuristic produces an answer that is wrong in a particularly insidious way: the answer "O₂ is limiting" is defensible under a misunderstanding, and a student who memorized the wrong rule would agree with it. A validator that checks only whether the answer is in the set of plausible-sounding answers would pass this. The validator needs to detect the explicit statement of the wrong reasoning principle in step 4.
This also means the fixture cannot be validated by answer-matching alone. The answer "O₂ is limiting" is not trivially wrong — it would be correct in a different problem where the ratio was not exactly 2:1. The wrongness is entirely in the reasoning, not in the form of the answer. This is why it is classified as a reasoning_inconsistency flag rather than an incorrect_answer flag.
On Fixture 10: The Decimal Error Trace
For implementers who need to verify the arithmetic independently:
W = nRT ln(V₂/V₁)
W = (0.5 mol)(8.314 J/mol·K)(300 K)(ln 3)
W = (0.5)(8.314)(300)(1.0986)
W = (4.157)(300)(1.0986)
W = (1247.1)(1.0986)
W ≈ 1370 J
The submitted solution claims W ≈ 137 J. The error is a factor of 10. The most likely source is a mis-keyed intermediate: computing 0.5 × 8.314 × 300 as 124.71 instead of 1247.1. This is consistent with accidentally dividing by 10 at the multiplication step, perhaps through misplaced decimal entry on a calculator.
The factor-of-10 error is important for validator implementation. A validator that checks "is the answer in a reasonable range for this type of problem" might pass 137 J because it is a plausible energy quantity. The validator needs to actually recompute the expression or at minimum check internal consistency — does step 4 follow numerically from step 3?
Checking: step 3 states 0.5 × 8.314 × 300 × 1.099. The result of this multiplication is ≈ 1370, not 137. The inconsistency between the expression and the claimed result is detectable without knowing the correct answer in advance. This is the recommended implementation path: flag the arithmetic inconsistency between step 3 and step 4, rather than relying on a reference answer lookup.
The rubric issue is secondary but cascading. Once the validator flags the arithmetic error, it should propagate that finding to the rubric check: if the solution arithmetic is wrong, any rubric criterion that encodes a numerical tolerance around the solution's answer is itself wrong. This is the cascading failure logic that elevates Fixture 10 from REVIEW to NO-GO. A single arithmetic error in isolation might warrant REVIEW. An arithmetic error that has infected the rubric's acceptance criterion means the entire grading framework is built on an incorrect number — that is a structural failure requiring NO-GO.
Appendix B: Verdict Tier Definitions for Implementers
These definitions are operational, not philosophical. They are designed to produce consistent classification behavior across the fixture set and in production.
GO
All the following are true:
-
No Tier 1 (objective) errors present
-
Rubric maps to solution with no unresolvable gaps
-
No required fields absent
-
No formatting issues that would block rendering
-
Units present and consistent (or unit-free in a context where dimensionless quantities are appropriate)
A GO verdict does not mean the submission is optimal. It means it is ready for publication without author intervention. Minor informational notes are compatible with GO — they are surfaced to help authors improve quality but do not require action.
REVIEW
At least one of the following is true, but no NO-GO conditions are met:
-
Tier 2 (subjective quality) issues are present
-
Rubric has gaps in partial credit guidance
-
Units are absent but not contradicted by rubric requirements
-
Required metadata fields are null or missing
-
Unstated assumptions that are reasonable but not documented
-
LaTeX formatting issues that are flagged but where the content remains interpretable (e.g., rendered formula is ambiguous but not broken beyond recovery)
A REVIEW verdict requires author attention before publication, but the validator cannot determine whether the submission should be published. That decision belongs to the author. The validator's role is to surface the specific issues and let the author make an informed call.
NO-GO
At least one of the following is true:
-
Tier 1 error that produces an incorrect or indeterminate answer
-
Mutually exclusive constraints with no resolution path
-
Reasoning that explicitly contradicts the mathematical setup and leads to an incorrect conclusion
-
Rubric that encodes an incorrect reference answer
-
Malformed LaTeX that prevents rendering and cannot be auto-corrected
-
Multiple Tier 2 issues that compound into a Tier 1 failure (this is the compound-failure rule: two REVIEW-level issues in the same critical path can combine to a NO-GO)
A NO-GO verdict blocks publication. The author must revise and resubmit. The validator output for NO-GO should always include specific, actionable remediation guidance — not just what is wrong, but what the corrected version should look like. A NO-GO without remediation guidance is a dead end for the author.
The compound-failure rule is the hardest part to implement consistently. It requires the validator to reason about whether two issues are in the same critical path. For Fixture 10, the arithmetic error and the rubric encoding of the wrong answer are in the same critical path: the rubric's reference answer was derived from the arithmetic, so the rubric error is a consequence of the arithmetic error, and together they make the grading framework unreliable. For contrast, in Fixture 9, the null topic field and the unstated material assumption are in different paths — one is metadata, one is content — so they do not compound and the verdict remains REVIEW.
Appendix C: Integration with Upstream Authoring Workflows
Where Validation Should Live
The validator is most valuable when it runs at two points in the authoring pipeline, not one.
Point 1: Interactive feedback during authoring. The validator runs incrementally as the author writes, flagging issues in near-real-time. At this point, the validator should only surface Tier 1 issues and avoid surfacing Tier 2 issues. The reason: authors in flow-state find Tier 2 feedback distracting. They know the rubric is vague — they are planning to fill it in later. Surfacing it during composition trains them to ignore the validator. Save the full validation report for submission time.
Point 2: Pre-submission gate. When the author submits, the full validator runs and produces the complete report with verdict, flags, and remediation guidance. This is where Tier 2 issues should appear. The author is in review mode, not composition mode, and is mentally prepared to evaluate feedback.
Running the validator only at pre-submission misses the opportunity for early intervention on Tier 1 issues. Running the full validator continuously during authoring creates noise that trains authors to ignore it. The two-point strategy captures the benefits of both and avoids the costs of each.
The Remediation Feedback Contract
Every flag in the validator output should conform to a four-field structure:
flag_id: unique identifier for this flag type
severity: INFO | WARN | FAIL
location: where in the submission the issue occurs
description: what is wrong, stated precisely
remediation: what the author should do to fix it
The remediation field is non-optional for WARN and FAIL severity flags. It is optional for INFO. Implementers who skip the remediation field to save development time will discover that authors cannot act on vague flags and override rate climbs immediately. The remediation field is the single most important driver of author compliance behavior.
For the fixtures in this document, example remediation messages would be:
Fixture 2, rubric alignment flag:
"The rubric criterion 'Correct use of energy conservation' does not match the solution, which uses kinematic equations. Either revise the rubric to describe the kinematic approach, or revise the solution to use energy conservation. If both approaches are acceptable, add a note to the problem statement indicating that either method may be used."
Fixture 8, constraint conflict flag:
"The problem statement describes the beam as both 'simply supported' and 'fixed at both ends.' These are mutually exclusive boundary conditions. For a simply supported beam, the correct maximum moment formula is wL²/8 = 20 kN·m. For a fixed-fixed beam, the formula is wL²/12 = 13.3 kN·m at supports and wL²/24 = 6.67 kN·m at midspan. Determine the intended boundary condition and revise the problem statement and solution accordingly."
Fixture 10, arithmetic flag:
"The calculation in step 4 appears to contain a decimal error. The expression 0.5 × 8.314 × 300 × 1.099 evaluates to approximately 1370 J, not 137 J. Please verify the arithmetic and update the final answer. Note that the rubric criterion 'Answer within 5% of 137 J' will also need to be revised once the correct answer is confirmed."
These are not brief error codes. They are actionable instructions. The difference in author experience between a terse error code and a specific remediation message is the difference between a validator that authors tolerate and one they trust.
Feedback Loop: Closing the Loop on Override Decisions
When authors override a REVIEW verdict to force publication, the system should capture:
-
Which flags were present at override time
-
Whether the author provided a justification
-
What happened downstream (did the content receive complaints, grading inconsistency reports, or rendering failures?)
This data is the training signal for improving the validator's calibration over time. Without it, the validator is operating open-loop — it produces output but never learns whether that output was correct or useful. With it, patterns emerge: specific flag types that are consistently overridden by experienced authors (suggesting the flag threshold is too sensitive) or specific flag types that are overridden and then generate downstream problems (suggesting authors are overconfident about dismissing them).
The override capture mechanism does not need to be elaborate. A simple log entry with flag IDs and a free-text justification field is sufficient for Phase 1. The analysis can happen manually on a monthly cadence. Automation of the feedback loop analysis is a Phase 3 concern, after the basic patterns have been established.
Appendix D: Known Limitations of the Fixture Set
Coverage gaps intentional at this stage:
The fixture set does not cover image-based content (diagrams, graphs, figures). This is a deliberate exclusion. Validating image content requires computer vision capabilities that are orthogonal to the text-based validation logic described here. Adding image fixtures at this stage would create scope confusion without adding proportionate value. Image validation should be a separate module with its own fixture set.
The fixture set does not cover multi-part problems where each part has its own rubric. These are common in STEM education but add significant structural complexity to the validation logic. The validation rules described here apply to single-part problems. Multi-part problem support requires extending the validator to handle part-level verdict aggregation and to check cross-part consistency (where a result in Part A is used as a given in Part B).
The fixture set does not cover language quality issues — clarity of prose in the problem statement, technical terminology accuracy, or reading level appropriateness. These are real quality concerns but they require natural language processing capabilities distinct from the structural and mathematical validation described here. Including them in the same fixture set would obscure the separation of concerns.
Coverage gaps that should be addressed in Phase 2:
Fixtures involving SI unit conversion errors across multi-step calculations (mentioned in the Phase 3 expansion candidates, but actually common enough to merit Phase 2 attention).
Fixtures involving conditional problems — "if X, find Y; if not X, find Z" — where the solution only addresses one branch and the rubric implicitly assumes the other.
Fixtures involving implicit reference data — problems that require looking up a value (e.g., a specific heat capacity, a bond enthalpy, a material modulus) that is standard in the domain but not provided in the problem statement. These are borderline acceptable in some authoring contexts and not in others, making them ideal for testing verdict calibration.
Fundamental limitation:
The fixture set can test whether the validator detects the specific failure patterns encoded in the fixtures. It cannot guarantee that the validator will detect novel failure patterns not represented in the fixture set. This is the inherent limitation of fixture-based validation. The remedy is not more fixtures (though more is better than fewer) but continuous monitoring of production outputs against human review, as described in Phase 2 of the implementation plan. The fixtures are a starting point and a regression test, not a complete specification of validator behavior.