Project Proctor Fellow Playbook

plancompleted

Provenance: Ingested from Professional/Project Proctor Fellow Playbook.md (28 KB). Complete operational playbook for the Handshake AI Project Proctor fellowship program.

Project Proctor Fellow Playbook

1. Overview

Goal: Design high-difficulty STEM problems to stress-test the advanced reasoning systems of frontier AI models.

Context: AI models currently excel at pattern recognition and surface-level reasoning but fail when multiple advanced concepts must be integrated, hidden assumptions exist, or step-by-step logical rigor is required.

Fellow Role: Create problems that cause deep reasoning failures, pushing models beyond memorization into structured, domain-specific thinking.

Benefits

  • Timely feedback and support on reviews and tasks
  • Direct coaching from STEM experts in office hours
  • Future growth opportunities with the Handshake AI team

Expectations

  • Complete onboarding
  • Maintain consistently strong approval rate
  • Maintain quality and timeliness in responding to reviews

2. Incentive Structure

Weekly leaderboard tracks R1 Approvals:

  • Top 3 fellows: $600 bonus
  • Fellows ranked 4-6: $250 bonus
  • Top performers: 5-10 approvals per week

3. Promotion / Demotion Criteria

Progression: New Fellow -> Throttled -> Fellow (Avg SQS >= 3 over >= 3 tasks) -> Star Fellow (Avg SQS >= 3.2)

Submission Quality Score (SQS):

| Score | Level | Description | |-------|-------|-------------| | 5 | Exceptional | Fully aligned, no edits needed | | 4 | Great | Meets core standards, minor clarity/formatting issues | | 3 | Acceptable (Min Passing) | Valid and salvageable, requires substantial edits | | 2 | Failure | Critical issue requiring rework | | 1 | Invalid | Does not follow guidelines, fundamentally flawed |

4. Logistics

  1. Verify Project Proctor (Production) on annotations platform dashboard
  2. Complete onboarding: link Stripe, accept terms, pass assessment, join HAI Slack
  3. Submit first task; monitored for quality and approval rate

5. Task Workflow (15 Steps)

  1. Pick domain + confirm
  2. Write clear prompt with one verifiable answer that challenges all models
  3. Ensure Model A stumped at least once (valid stump)
  4. Ensure Model B stumped at least once (valid stump)
  5. Submit exact final answer (concise, no explanation/units)
  6. Pass@K auto-evaluation (both models must have at least one failure; at least one correct response)
  7. Record number of failed model responses
  8. Provide failure rationale for Model A
  9. Provide failure rationale for Model B
  10. State final answer format
  11. Provide subdomain, education level, difficulty level
  12. Provide complete step-by-step solution
  13. Write rubrics (2-7 items, total = 7 points)
  14. If Quality Check flag: provide justification
  15. Submit

6. Tasking Instructions

Prompt Quality Standards

Difficulty: Minimum High School Olympiad; target Graduate/PhD-level. Must require deep reasoning, not routine computation.

Originality: 100% original. Not from textbooks, papers, competitions, or websites. Not searchable online.

Self-Containment: Unambiguous and fully solvable without external resources. Do not assume model knows any specific paper.

Formatting: Text-based, not multiple choice. Markdown + KaTeX for math. Images only for Electrical Engineering circuit diagrams.

Model Failure Validation

Valid stumps: Fundamental conceptual misunderstandings, incorrect theorem/principle application, flawed logical reasoning, missing critical steps.

Does not count: Rounding/arithmetic errors, formatting issues, small computational slips with otherwise correct reasoning.

Reasoning Types Assessed

Deductive, Inductive, Temporal, Spatial, Causal, Comparative analysis, Abstract, Pattern recognition, Statistical, Abductive, Hypothetical.

7. Solution Requirements

Step-by-step solution must:

  • Write every reasoning step explicitly
  • Show all intermediate calculations
  • Justify all formulas
  • State all assumptions
  • Be independently verifiable by reviewer

8. Rubric Standards

2-7 criteria, total weight = 7 points. Each criterion:

  • Atomic (one check only)
  • Self-contained
  • Verb-led (Derives, States, Extracts, Identifies, Calculates, etc.)
  • Includes ground-truth values and tolerance ranges
  • Last criterion = final answer check

Five Key Principles: Verb-Led, Atomic, Self-Contained, Grading Guidance, Final Answer Last.

9. Domains

Physics/Astronomy, Chemistry, Biology, Engineering.

10. LaTeX Standards

  • All numerical values and variables in $ delimiters
  • Units inside \text{} to prevent italicization
  • Numbers given to at least the precision requested for the answer
  • For numerical answers, prompt must specify units
  • For equation answers, all required variables must be specified

11. Pre-Submission Checklist

Prompt Quality: clarity, self-containment, difficulty (Graduate/PhD), originality, proper LaTeX formatting.

Model Failure: At least one failure from each model; reasoning error (not arithmetic).

Failure Rationale: Exact reasoning error identified, explained, referenced to specific step.

Solution: Complete, all steps shown, formulas justified, assumptions stated, final answer labeled.

Hints: Exactly 3 progressive hints; independently useful; do not reference each other.

Rubric: 2-7 criteria; total weight = 7; atomic; self-contained; verb-led; includes tolerances.

12. Critical Fail Conditions

Any of the following immediately invalidates a pipeline submission:

  • Random train/test split on time-series data
  • Threshold tuned on test set (data leakage)
  • Accuracy used as primary optimization metric
  • Future data in feature calculations (look-ahead bias)
  • Asymmetric business cost ignored or never computed