Template Quick Reference

Pick your task type → use the matching template → adapt the rubric pattern.

Which template?

| Task type | Template | Prompt style | Golden solution format | |-----------|----------|-------------|----------------------| | ML / data science | template_coding_task.md | Informal + data generator | Runnable Python | | Software engineering | template_coding_task.md | Bug report or feature spec | Runnable code + tests | | Finance / regulatory | template_analytical_task.md | Formal memo + data tables | Narrative + verification script | | Health / epidemiology | template_analytical_task.md | Clinical brief + cohort data | Narrative + computation | | Engineering / physics | Either | Depends on whether code is the deliverable | Simulation or analysis |

Rubric patterns by task type

ML / Data Science (coding)

Typical 5 areas:

Data handling — leakage prevention, correct labels, proper splits
Feature engineering — backward-looking, grouped, reasonable
Model training — imbalance handling, appropriate method
Evaluation protocol — validation vs test, right metrics, cost objective
Completeness — runnable code, explained decisions

Common critical fails:

Random split on temporal data
Threshold/hyperparameter tuned on test
Accuracy as primary metric on imbalanced data
Future information in features

Finance / Regulatory (analytical)

Typical 5 areas:

Data quality — catches planted errors in tables
Methodology — applies required standard, not shortcuts
Scope discipline — excludes red herrings, stays within approved model
Numerical accuracy — headline result within tolerance
Disclosure — acknowledges assumptions, limitations, sensitivity

Common critical fails:

Misses data quality error that cascades through computation
Uses excluded variable or unapproved method
Narrative contradicts own numbers
No sensitivity analysis on key assumptions

Software Engineering (coding)

Typical 5 areas:

Correctness — handles edge cases, passes test suite
Architecture — appropriate patterns, separation of concerns
Error handling — graceful failures, input validation
Performance — no obvious O(n²) where O(n) is possible
Readability — clear naming, documented decisions

Common critical fails:

Silently wrong on edge cases
Security vulnerability in obvious place
Code doesn't compile/run
Ignores stated constraints

Health / Epidemiology (analytical)

Typical 5 areas:

Study design — correct comparison groups, confounding addressed
Statistical method — appropriate test, assumptions checked
Data issues — missing data handling, population definition
Interpretation — causal language matches study design
Clinical relevance — effect size context, not just p-values

Common critical fails:

Causal claims from observational data
Wrong statistical test for data structure
Ignores confounders present in the data
Misinterprets clinical significance

Trap selection by domain

Data integrity traps (good for any domain)

Row/column sum error → Finance (matrices), ML (confusion matrices), Health (population tables)
Narrative contradicts data → Any memo-format prompt
Legacy category label → Finance (ratings), Health (ICD codes)
Red herring variable → Any task with "approved methodology"
Unit mismatch → Finance (M vs K), Health (mg vs g), Engineering (metric vs imperial)

Reasoning traps (good for computation-heavy tasks)

Must transform before comparing → Finance (annualize PDs), Health (age-adjust rates), ML (normalize metrics)
Wrong default threshold → ML (0.5 classification), Finance (materiality), Health (screening cutoff)
Temporal ordering matters → ML (time series), Finance (vintage), Health (longitudinal)
Compounding invisible error → Finance (matrix power), Engineering (tolerance stacking)

Scope traps (good for constrained tasks)

Tempting extension excluded by instructions → Any regulated domain
Ambiguous parameter needing assumption + disclosure → Finance (EIR), ML (missing hyperparam)

Notebook workflow

1. Open mercor_scorer.ipynb
2. Cell 1: Edit BENCHMARK and RUBRIC for your task
3. Cell 2: Paste Gemini's response
4. Cell 3: Run → see automated keyword scores
5. Cell 4: Fill manual overrides for items needing judgment
6. Cell 5: Fill hint results (after sending hints)
7. Cell 6: Fill observations → get paste-ready Field 5
8. Cell 7: Export JSON record

Total time after Gemini testing: ~10 minutes to score and generate all form fields.

Proctor Task Quick Reference

Template Quick Reference

Which template?

Rubric patterns by task type

ML / Data Science (coding)

Finance / Regulatory (analytical)

Software Engineering (coding)

Health / Epidemiology (analytical)

Trap selection by domain

Data integrity traps (good for any domain)

Reasoning traps (good for computation-heavy tasks)

Scope traps (good for constrained tasks)

Notebook workflow