Proctor Task Quick Reference

assetactive

Template Quick Reference

Pick your task type → use the matching template → adapt the rubric pattern.

Which template?

| Task type | Template | Prompt style | Golden solution format | |-----------|----------|-------------|----------------------| | ML / data science | template_coding_task.md | Informal + data generator | Runnable Python | | Software engineering | template_coding_task.md | Bug report or feature spec | Runnable code + tests | | Finance / regulatory | template_analytical_task.md | Formal memo + data tables | Narrative + verification script | | Health / epidemiology | template_analytical_task.md | Clinical brief + cohort data | Narrative + computation | | Engineering / physics | Either | Depends on whether code is the deliverable | Simulation or analysis |

Rubric patterns by task type

ML / Data Science (coding)

Typical 5 areas:

  1. Data handling — leakage prevention, correct labels, proper splits
  2. Feature engineering — backward-looking, grouped, reasonable
  3. Model training — imbalance handling, appropriate method
  4. Evaluation protocol — validation vs test, right metrics, cost objective
  5. Completeness — runnable code, explained decisions

Common critical fails:

  • Random split on temporal data
  • Threshold/hyperparameter tuned on test
  • Accuracy as primary metric on imbalanced data
  • Future information in features

Finance / Regulatory (analytical)

Typical 5 areas:

  1. Data quality — catches planted errors in tables
  2. Methodology — applies required standard, not shortcuts
  3. Scope discipline — excludes red herrings, stays within approved model
  4. Numerical accuracy — headline result within tolerance
  5. Disclosure — acknowledges assumptions, limitations, sensitivity

Common critical fails:

  • Misses data quality error that cascades through computation
  • Uses excluded variable or unapproved method
  • Narrative contradicts own numbers
  • No sensitivity analysis on key assumptions

Software Engineering (coding)

Typical 5 areas:

  1. Correctness — handles edge cases, passes test suite
  2. Architecture — appropriate patterns, separation of concerns
  3. Error handling — graceful failures, input validation
  4. Performance — no obvious O(n²) where O(n) is possible
  5. Readability — clear naming, documented decisions

Common critical fails:

  • Silently wrong on edge cases
  • Security vulnerability in obvious place
  • Code doesn't compile/run
  • Ignores stated constraints

Health / Epidemiology (analytical)

Typical 5 areas:

  1. Study design — correct comparison groups, confounding addressed
  2. Statistical method — appropriate test, assumptions checked
  3. Data issues — missing data handling, population definition
  4. Interpretation — causal language matches study design
  5. Clinical relevance — effect size context, not just p-values

Common critical fails:

  • Causal claims from observational data
  • Wrong statistical test for data structure
  • Ignores confounders present in the data
  • Misinterprets clinical significance

Trap selection by domain

Data integrity traps (good for any domain)

  • Row/column sum error → Finance (matrices), ML (confusion matrices), Health (population tables)
  • Narrative contradicts data → Any memo-format prompt
  • Legacy category label → Finance (ratings), Health (ICD codes)
  • Red herring variable → Any task with "approved methodology"
  • Unit mismatch → Finance (M vs K), Health (mg vs g), Engineering (metric vs imperial)

Reasoning traps (good for computation-heavy tasks)

  • Must transform before comparing → Finance (annualize PDs), Health (age-adjust rates), ML (normalize metrics)
  • Wrong default threshold → ML (0.5 classification), Finance (materiality), Health (screening cutoff)
  • Temporal ordering matters → ML (time series), Finance (vintage), Health (longitudinal)
  • Compounding invisible error → Finance (matrix power), Engineering (tolerance stacking)

Scope traps (good for constrained tasks)

  • Tempting extension excluded by instructions → Any regulated domain
  • Ambiguous parameter needing assumption + disclosure → Finance (EIR), ML (missing hyperparam)

Notebook workflow

1. Open mercor_scorer.ipynb
2. Cell 1: Edit BENCHMARK and RUBRIC for your task
3. Cell 2: Paste Gemini's response
4. Cell 3: Run → see automated keyword scores
5. Cell 4: Fill manual overrides for items needing judgment
6. Cell 5: Fill hint results (after sending hints)
7. Cell 6: Fill observations → get paste-ready Field 5
8. Cell 7: Export JSON record

Total time after Gemini testing: ~10 minutes to score and generate all form fields.