Proctor Task Quick Reference
Template Quick Reference
Pick your task type → use the matching template → adapt the rubric pattern.
Which template?
| Task type | Template | Prompt style | Golden solution format |
|-----------|----------|-------------|----------------------|
| ML / data science | template_coding_task.md | Informal + data generator | Runnable Python |
| Software engineering | template_coding_task.md | Bug report or feature spec | Runnable code + tests |
| Finance / regulatory | template_analytical_task.md | Formal memo + data tables | Narrative + verification script |
| Health / epidemiology | template_analytical_task.md | Clinical brief + cohort data | Narrative + computation |
| Engineering / physics | Either | Depends on whether code is the deliverable | Simulation or analysis |
Rubric patterns by task type
ML / Data Science (coding)
Typical 5 areas:
- Data handling — leakage prevention, correct labels, proper splits
- Feature engineering — backward-looking, grouped, reasonable
- Model training — imbalance handling, appropriate method
- Evaluation protocol — validation vs test, right metrics, cost objective
- Completeness — runnable code, explained decisions
Common critical fails:
- Random split on temporal data
- Threshold/hyperparameter tuned on test
- Accuracy as primary metric on imbalanced data
- Future information in features
Finance / Regulatory (analytical)
Typical 5 areas:
- Data quality — catches planted errors in tables
- Methodology — applies required standard, not shortcuts
- Scope discipline — excludes red herrings, stays within approved model
- Numerical accuracy — headline result within tolerance
- Disclosure — acknowledges assumptions, limitations, sensitivity
Common critical fails:
- Misses data quality error that cascades through computation
- Uses excluded variable or unapproved method
- Narrative contradicts own numbers
- No sensitivity analysis on key assumptions
Software Engineering (coding)
Typical 5 areas:
- Correctness — handles edge cases, passes test suite
- Architecture — appropriate patterns, separation of concerns
- Error handling — graceful failures, input validation
- Performance — no obvious O(n²) where O(n) is possible
- Readability — clear naming, documented decisions
Common critical fails:
- Silently wrong on edge cases
- Security vulnerability in obvious place
- Code doesn't compile/run
- Ignores stated constraints
Health / Epidemiology (analytical)
Typical 5 areas:
- Study design — correct comparison groups, confounding addressed
- Statistical method — appropriate test, assumptions checked
- Data issues — missing data handling, population definition
- Interpretation — causal language matches study design
- Clinical relevance — effect size context, not just p-values
Common critical fails:
- Causal claims from observational data
- Wrong statistical test for data structure
- Ignores confounders present in the data
- Misinterprets clinical significance
Trap selection by domain
Data integrity traps (good for any domain)
- Row/column sum error → Finance (matrices), ML (confusion matrices), Health (population tables)
- Narrative contradicts data → Any memo-format prompt
- Legacy category label → Finance (ratings), Health (ICD codes)
- Red herring variable → Any task with "approved methodology"
- Unit mismatch → Finance (M vs K), Health (mg vs g), Engineering (metric vs imperial)
Reasoning traps (good for computation-heavy tasks)
- Must transform before comparing → Finance (annualize PDs), Health (age-adjust rates), ML (normalize metrics)
- Wrong default threshold → ML (0.5 classification), Finance (materiality), Health (screening cutoff)
- Temporal ordering matters → ML (time series), Finance (vintage), Health (longitudinal)
- Compounding invisible error → Finance (matrix power), Engineering (tolerance stacking)
Scope traps (good for constrained tasks)
- Tempting extension excluded by instructions → Any regulated domain
- Ambiguous parameter needing assumption + disclosure → Finance (EIR), ML (missing hyperparam)
Notebook workflow
1. Open mercor_scorer.ipynb
2. Cell 1: Edit BENCHMARK and RUBRIC for your task
3. Cell 2: Paste Gemini's response
4. Cell 3: Run → see automated keyword scores
5. Cell 4: Fill manual overrides for items needing judgment
6. Cell 5: Fill hint results (after sending hints)
7. Cell 6: Fill observations → get paste-ready Field 5
8. Cell 7: Export JSON record
Total time after Gemini testing: ~10 minutes to score and generate all form fields.