Provenance: Ingested from Professional/Project Proctor Fellow Playbook.md (28 KB). Complete operational playbook for the Handshake AI Project Proctor fellowship program.

Project Proctor Fellow Playbook

1. Overview

Goal: Design high-difficulty STEM problems to stress-test the advanced reasoning systems of frontier AI models.

Context: AI models currently excel at pattern recognition and surface-level reasoning but fail when multiple advanced concepts must be integrated, hidden assumptions exist, or step-by-step logical rigor is required.

Fellow Role: Create problems that cause deep reasoning failures, pushing models beyond memorization into structured, domain-specific thinking.

Benefits

Timely feedback and support on reviews and tasks
Direct coaching from STEM experts in office hours
Future growth opportunities with the Handshake AI team

Expectations

Complete onboarding
Maintain consistently strong approval rate
Maintain quality and timeliness in responding to reviews

2. Incentive Structure

Weekly leaderboard tracks R1 Approvals:

Top 3 fellows: $600 bonus
Fellows ranked 4-6: $250 bonus
Top performers: 5-10 approvals per week

3. Promotion / Demotion Criteria

Progression: New Fellow -> Throttled -> Fellow (Avg SQS >= 3 over >= 3 tasks) -> Star Fellow (Avg SQS >= 3.2)

Submission Quality Score (SQS):

| Score | Level | Description | |-------|-------|-------------| | 5 | Exceptional | Fully aligned, no edits needed | | 4 | Great | Meets core standards, minor clarity/formatting issues | | 3 | Acceptable (Min Passing) | Valid and salvageable, requires substantial edits | | 2 | Failure | Critical issue requiring rework | | 1 | Invalid | Does not follow guidelines, fundamentally flawed |

4. Logistics

Verify Project Proctor (Production) on annotations platform dashboard
Complete onboarding: link Stripe, accept terms, pass assessment, join HAI Slack
Submit first task; monitored for quality and approval rate

5. Task Workflow (15 Steps)

Pick domain + confirm
Write clear prompt with one verifiable answer that challenges all models
Ensure Model A stumped at least once (valid stump)
Ensure Model B stumped at least once (valid stump)
Submit exact final answer (concise, no explanation/units)
Pass@K auto-evaluation (both models must have at least one failure; at least one correct response)
Record number of failed model responses
Provide failure rationale for Model A
Provide failure rationale for Model B
State final answer format
Provide subdomain, education level, difficulty level
Provide complete step-by-step solution
Write rubrics (2-7 items, total = 7 points)
If Quality Check flag: provide justification
Submit

6. Tasking Instructions

Prompt Quality Standards

Difficulty: Minimum High School Olympiad; target Graduate/PhD-level. Must require deep reasoning, not routine computation.

Originality: 100% original. Not from textbooks, papers, competitions, or websites. Not searchable online.

Self-Containment: Unambiguous and fully solvable without external resources. Do not assume model knows any specific paper.

Formatting: Text-based, not multiple choice. Markdown + KaTeX for math. Images only for Electrical Engineering circuit diagrams.

Model Failure Validation

Valid stumps: Fundamental conceptual misunderstandings, incorrect theorem/principle application, flawed logical reasoning, missing critical steps.

Does not count: Rounding/arithmetic errors, formatting issues, small computational slips with otherwise correct reasoning.

Reasoning Types Assessed

Deductive, Inductive, Temporal, Spatial, Causal, Comparative analysis, Abstract, Pattern recognition, Statistical, Abductive, Hypothetical.

7. Solution Requirements

Step-by-step solution must:

Write every reasoning step explicitly
Show all intermediate calculations
Justify all formulas
State all assumptions
Be independently verifiable by reviewer

8. Rubric Standards

2-7 criteria, total weight = 7 points. Each criterion:

Atomic (one check only)
Self-contained
Verb-led (Derives, States, Extracts, Identifies, Calculates, etc.)
Includes ground-truth values and tolerance ranges
Last criterion = final answer check

Five Key Principles: Verb-Led, Atomic, Self-Contained, Grading Guidance, Final Answer Last.

9. Domains

Physics/Astronomy, Chemistry, Biology, Engineering.

10. LaTeX Standards

All numerical values and variables in $ delimiters
Units inside \text{} to prevent italicization
Numbers given to at least the precision requested for the answer
For numerical answers, prompt must specify units
For equation answers, all required variables must be specified

11. Pre-Submission Checklist

Prompt Quality: clarity, self-containment, difficulty (Graduate/PhD), originality, proper LaTeX formatting.

Model Failure: At least one failure from each model; reasoning error (not arithmetic).

Failure Rationale: Exact reasoning error identified, explained, referenced to specific step.

Solution: Complete, all steps shown, formulas justified, assumptions stated, final answer labeled.

Hints: Exactly 3 progressive hints; independently useful; do not reference each other.

Rubric: 2-7 criteria; total weight = 7; atomic; self-contained; verb-led; includes tolerances.

12. Critical Fail Conditions

Any of the following immediately invalidates a pipeline submission:

Random train/test split on time-series data
Threshold tuned on test set (data leakage)
Accuracy used as primary optimization metric
Future data in feature calculations (look-ahead bias)
Asymmetric business cost ignored or never computed