Project Proctor Fellow Playbook
Provenance: Ingested from
Professional/Project Proctor Fellow Playbook.md(28 KB). Complete operational playbook for the Handshake AI Project Proctor fellowship program.
Project Proctor Fellow Playbook
1. Overview
Goal: Design high-difficulty STEM problems to stress-test the advanced reasoning systems of frontier AI models.
Context: AI models currently excel at pattern recognition and surface-level reasoning but fail when multiple advanced concepts must be integrated, hidden assumptions exist, or step-by-step logical rigor is required.
Fellow Role: Create problems that cause deep reasoning failures, pushing models beyond memorization into structured, domain-specific thinking.
Benefits
- Timely feedback and support on reviews and tasks
- Direct coaching from STEM experts in office hours
- Future growth opportunities with the Handshake AI team
Expectations
- Complete onboarding
- Maintain consistently strong approval rate
- Maintain quality and timeliness in responding to reviews
2. Incentive Structure
Weekly leaderboard tracks R1 Approvals:
- Top 3 fellows: $600 bonus
- Fellows ranked 4-6: $250 bonus
- Top performers: 5-10 approvals per week
3. Promotion / Demotion Criteria
Progression: New Fellow -> Throttled -> Fellow (Avg SQS >= 3 over >= 3 tasks) -> Star Fellow (Avg SQS >= 3.2)
Submission Quality Score (SQS):
| Score | Level | Description | |-------|-------|-------------| | 5 | Exceptional | Fully aligned, no edits needed | | 4 | Great | Meets core standards, minor clarity/formatting issues | | 3 | Acceptable (Min Passing) | Valid and salvageable, requires substantial edits | | 2 | Failure | Critical issue requiring rework | | 1 | Invalid | Does not follow guidelines, fundamentally flawed |
4. Logistics
- Verify Project Proctor (Production) on annotations platform dashboard
- Complete onboarding: link Stripe, accept terms, pass assessment, join HAI Slack
- Submit first task; monitored for quality and approval rate
5. Task Workflow (15 Steps)
- Pick domain + confirm
- Write clear prompt with one verifiable answer that challenges all models
- Ensure Model A stumped at least once (valid stump)
- Ensure Model B stumped at least once (valid stump)
- Submit exact final answer (concise, no explanation/units)
- Pass@K auto-evaluation (both models must have at least one failure; at least one correct response)
- Record number of failed model responses
- Provide failure rationale for Model A
- Provide failure rationale for Model B
- State final answer format
- Provide subdomain, education level, difficulty level
- Provide complete step-by-step solution
- Write rubrics (2-7 items, total = 7 points)
- If Quality Check flag: provide justification
- Submit
6. Tasking Instructions
Prompt Quality Standards
Difficulty: Minimum High School Olympiad; target Graduate/PhD-level. Must require deep reasoning, not routine computation.
Originality: 100% original. Not from textbooks, papers, competitions, or websites. Not searchable online.
Self-Containment: Unambiguous and fully solvable without external resources. Do not assume model knows any specific paper.
Formatting: Text-based, not multiple choice. Markdown + KaTeX for math. Images only for Electrical Engineering circuit diagrams.
Model Failure Validation
Valid stumps: Fundamental conceptual misunderstandings, incorrect theorem/principle application, flawed logical reasoning, missing critical steps.
Does not count: Rounding/arithmetic errors, formatting issues, small computational slips with otherwise correct reasoning.
Reasoning Types Assessed
Deductive, Inductive, Temporal, Spatial, Causal, Comparative analysis, Abstract, Pattern recognition, Statistical, Abductive, Hypothetical.
7. Solution Requirements
Step-by-step solution must:
- Write every reasoning step explicitly
- Show all intermediate calculations
- Justify all formulas
- State all assumptions
- Be independently verifiable by reviewer
8. Rubric Standards
2-7 criteria, total weight = 7 points. Each criterion:
- Atomic (one check only)
- Self-contained
- Verb-led (Derives, States, Extracts, Identifies, Calculates, etc.)
- Includes ground-truth values and tolerance ranges
- Last criterion = final answer check
Five Key Principles: Verb-Led, Atomic, Self-Contained, Grading Guidance, Final Answer Last.
9. Domains
Physics/Astronomy, Chemistry, Biology, Engineering.
10. LaTeX Standards
- All numerical values and variables in
$delimiters - Units inside
\text{}to prevent italicization - Numbers given to at least the precision requested for the answer
- For numerical answers, prompt must specify units
- For equation answers, all required variables must be specified
11. Pre-Submission Checklist
Prompt Quality: clarity, self-containment, difficulty (Graduate/PhD), originality, proper LaTeX formatting.
Model Failure: At least one failure from each model; reasoning error (not arithmetic).
Failure Rationale: Exact reasoning error identified, explained, referenced to specific step.
Solution: Complete, all steps shown, formulas justified, assumptions stated, final answer labeled.
Hints: Exactly 3 progressive hints; independently useful; do not reference each other.
Rubric: 2-7 criteria; total weight = 7; atomic; self-contained; verb-led; includes tolerances.
12. Critical Fail Conditions
Any of the following immediately invalidates a pipeline submission:
- Random train/test split on time-series data
- Threshold tuned on test set (data leakage)
- Accuracy used as primary optimization metric
- Future data in feature calculations (look-ahead bias)
- Asymmetric business cost ignored or never computed