Project Proctor Fellow Playbook

1. Overview

Welcome to Project Proctor!

Purpose

Goal: Design high-difficulty STEM problems to truly stress-test the advanced reasoning systems of frontier AI.

Context: AI models currently excel in pattern recognition, memorizing formulas, and performing surface-level reasoning. However, they often fail when asked problems beyond their current scope of training. They fail when:

Multiple advanced concepts must be integrated
There are hidden assumptions
Logical rigor is required step-by-step

Your Role: As fellows, create problems that cause deep reasoning failures. You'll help push the next generation of models beyond memorization into structured, domain-specific thinking.

Benefits

Timely feedback and support on your reviews and tasks
Direct coaching from STEM experts in office hours
Future opportunities to grow with the Handshake AI team as a star fellow or reviewer

Expectations

Complete onboarding
Maintain a consistently strong approval rate
Maintain consistent quality and timeliness in responding to reviews

2. Leaderboard

Weekly Incentive:

Fellows in top 3: Get a $600 bonus
Fellows in top 4–6: Get a $250 bonus

Leaderboard tracks R1 Approvals per week (e.g., week of 3/9). Top performers include fellows with 5–10 approvals per week.

3. Promotion / Demotion Criteria

Fellow Role and Promotion/Demotion Flow:

New Fellow → Throttled → Fellow (after Avg SQS ≥ 3 over ≥ 3 tasks) → Star Fellow (Avg SQS ≥ 3.2)
Demoted path: Fellow → Fellow-Throttled → Fellow on hold (SQS 1)

Submission Quality Score (SQS) Legend

| Score | Level | Description | |-------|-------|-------------| | 5 | Exceptional | Fully aligned with all standards. No meaningful edits needed. | | 4 | Great | Meets core quality standards with all valid components. May have minor clarity/formatting issues. Only light edits needed. | | 3 | Acceptable (Minimum Passing) | Core prompt is valid and salvageable, but other parts contain meaning issues. Requires substantial edits or restructuring. | | 2 | Failure (Revision Required) | Contains at least one critical, prompt/answer-related issue requiring rework. Must be sent back for revision. | | 1 | Invalid / Not Following Instructions | Does not follow Playbook guidelines. Fundamentally flawed, off-scope, or cannot be evaluated meaningfully. Must be sent back for revision. |

4. Logistics Hub

Things to do before you get started:

Go to the Annotations Platform
- Verify you see Project Proctor (Production) on your dashboard
- Reach out via Slack or email if you do not see it
Complete the Onboarding steps
- Ensure your Stripe account is linked
- Accept the Project Terms
- Pass the assessment
- Join the HAI Fellowship Slack Workspace (added within 24 hours to relevant Project Proctor channels for all communication)
Submit Your First Task
- Upon submission, your task will either: 1) get approved or 2) be sent back with feedback
- Note: submission quality and approval rate will be closely monitored
- Once your first task is approved, you will be able to start tasking freely

Lovable Hub – central repository for all links, training, resources, and documents related to Project Proctor (pw: FrontierFail2)

5. Task Workflow

A 15-step visual workflow:

Pick your domain + confirm (must be copied/pasted to work with rubric evals)
Write a clear prompt in the chosen domain with one verifiable answer that requires expertise and challenges all models
Ensure either Model A Response 1 or 2 is stumped, confirm the stump is valid
Ensure either Model B Response 1 or 2 is stumped, confirm the stump is valid
Submit only the exact final answer (concise, clear, without explanation or units)
Pass@K automatically evaluates whether the task is acceptable (Model A has at least one reasoning failure; Model B has at least one reasoning failure; at least one response is correct across Pass@K, Model A, and Model B). If conditions not met, the task is unacceptable.
Record how many model responses failed (reminder: at least one Model A and one Model B response must fail)
Provide a clear rationale explaining why Model A was stumped
Provide a clear rationale explaining why Model B was stumped (same as #8 for Model B side)
State your final answer format
Provide the subdomain of the task, education level, and difficulty level
Provide a complete, original step-by-step solution (show all work) in format: Step 1 > Step 2 > ... > Final Answer
Write Rubrics!
If submitting with a Quality Check flag, provide your justification for why the flag is wrong
You're ready to submit! Congratulations!

6. Tasking Instructions

Overview

This project is focused on developing challenging prompts and high-quality rubrics designed to push the limits of state-of-the-art models. Then you'll provide a set of progressive hints, and a detailed scoring rubric to improve future responses.

Goal: Your goal is to induce deep reasoning failures in the model, where the model also gives an incorrect final answer.

Your problems should be tough enough to cause substantive reasoning failures in AI models, going beyond simple computational slips or formatting glitches.

Use Control+F (or Command+F on a Mac) to easily navigate these instructions.

Step 0: Select the Domain

Categorize the task correctly to ensure it fits the project scope. Options: Chemistry, Biology, Engineering, Physics, Astronomy.

Important: Know your limits! Only complete tasks within your domain expertise.

Step 1: Design Your Prompt

Write a domain-specific prompt that challenges the model's advanced reasoning capabilities.

Read through the instructions — Create a clear, unambiguous prompt that:
- Requires domain expertise
- Stumps each model at least once
- Has exactly one verifiable answer
Confirm you will adhere to the Fellow guidelines (checkboxes for: not duplicative/LLM-generated, self-contained, requires domain expertise at Graduate/PhD level, answer cannot be guessed)
Enter your prompt

Notes:

Platform uses Markdown for text and KaTeX for math — verify rendering
Image Policy: STEM images, specifically circuit diagrams, allowed only for Electrical Engineering prompts. All other images must be self-generated.

Prompt Quality Standards:

Difficulty Requirements:

Minimum: High School Olympiad
Target level: Graduate and PhD-level
Must require deep reasoning, not routine computation
Avoid straightforward, textbook-style references

Originality Requirements:

Must be 100% original
Must not be copied from textbooks, research papers, competitions, or websites
Must not appear in online search results
May use external papers/references for inspiration, but the problem itself must be novel

Self-Containment:

The problem must be unambiguous and fully solvable without external resources
Do not assume the model knows any specific paper

Formatting:

Must be text-based
Cannot be multiple choice (explicit or implicit)
Should explicitly note output answer format when possible
Images may be included in the prompt only (never in the solution); image attachments are currently on hold

Question Types:

Ranking tasks with 4+ items are allowed
Binary (yes/no, true/false) and ternary (increase/decrease/no change) questions are not allowed

Additional Rules:

Don't exploit the same model weakness more than once per week
Templated prompts are not allowed
Keep your prompts diverse in structure and concepts
Avoid questions requiring niche knowledge that hinges on obscure, low-citation papers
When possible, indicate answer format and calculation precision in your prompt — recommend requesting 2–3 significant figures for the final answer
For intermediate steps, constrain rounding to 5 decimal places (7 for amplifying functions like log, ln, exponentials)
Use Markdown-compatible notation and avoid the standard * multiplication symbol (it conflicts with italics)

Step 2: Generate and Review Model A's Responses

Once you submit your prompt it will immediately kick off Model A's reasoning process.

If the model is taking a while to respond, you may pause the task and work on a different task. The model will continue running in the background.

If Model A does not generate an error with at least 1 of its two responses, do not proceed to Model B. Instead, refine the prompt and re-test Model A.

Step 3: Generate and Review Model B's Response

Once you have confirmed a reasoning failure in Model A, generate two responses from Model B.

Only proceed upon confirming that Model A makes at least one reasoning error.

Step 4: Confirm Model Reasoning Errors

Your prompts must be difficult enough that state-of-the-art AI models cannot reliably solve them. Looking for significant reasoning errors, not just superficial mistakes like rounding or arithmetic.

Model A: Response 1, Response 2 Model B: Response 1, Response 2

At least one response from each model must contain a critical error that leads to an incorrect answer. If Model A does not generate an error, do not proceed to Model B. Instead, refine the prompt and re-test Model A.

Valid Model Stumps:

Fundamental conceptual misunderstandings
Incorrect application of theorems or principles
Flawed logical reasoning or invalid deductions
Missing critical steps

Does Not Count:

Rounding errors or arithmetic mistakes
Formatting issues
Otherwise correct reasoning with small computational slips

Non-Exhaustive List of Reasoning Types:

| Reasoning Type | Definition | |---|---| | Deductive reasoning | Drawing specific conclusions from general laws or principles | | Inductive reasoning | Generalizing from patterns or experimental observations | | Temporal reasoning | Predicting events or states based on order in time | | Spatial reasoning | Understanding structures, orientation, or symmetry | | Causal reasoning | Identifying cause-and-effect relationships | | Comparative analysis | Judging between alternatives or evaluating differences across conditions | | Abstract reasoning | Working with non-concrete or theoretical ideas | | Pattern recognition | Spotting and interpreting regularities in data or sequences | | Statistical reasoning | Using data, probabilities, and distributions to reach conclusions | | Abductive reasoning | Inferring the most likely explanation from incomplete evidence | | Hypothetical reasoning | Predicting outcomes under hypothetical or counterfactual scenarios |

Step 5: Submit the Final Answer

Enter only the exact final answer in the Final Answer field. Must be concise, clear, and unambiguous, with no explanations, labels, commentary, or units unless explicitly required in the prompt.

Upon submission, you will see a Pass@K test begin to run. If Pass@K fails, you can retrigger it by resubmitting the Final Answer block without editing it.

Step 6: Pass@K Evaluation

After submission, Pass@K will automatically run, and you will see Model A and Model B responses appear.

Important note: We love complex prompts, but to ensure your work is immediately valuable for training, your prompt must be solved correctly by at least 1 of the tested responses (1 out of 1 total).

If both Model A responses and both Model B responses fail, you must earn at least a pass @ score
If at least one of those responses is correct, a 0/8 pass @ score is acceptable

3-6-26 New Pass@ Guidance — Tasks that don't meet these criteria will be automatically returned to the attempter upon submission.

Step 7: Confirm Validity of Responses

Certify that Model A and Model B were stumped at least once AND one of the responses exhibited no reasoning errors and got to the correct final answer.

Select how many responses from the detailed response blocks between Model A and Model B responded incorrectly. Must be a value of 2, 3, or 4 (minimum 1 failure from each model to continue).

Step 8: Provide Failure Rationale

Provide one detailed failure critique for Model A and another for Model B explaining the specific reasoning errors.

You will see fields for which Model A/B failure you are justifying (Response 1 or Response 2). Provide an in-depth justification of how, where, and WHY it failed. The more detailed your justification, the higher chance your task has to be approved.

Step 9: Select Final Answer Format

For metadata purposes. Please specify the format of the final answer.

Options: Integer, Decimal, Fraction, Text (case sensitive), Text (case insensitive), Ordered list, Unordered list

Step 10: Add References

Optional. You may include 1–5 URLs for any source material that inspired your prompt/solution (or leave as N/A).

If you have used any reference material (paper, article, textbook, online post, etc.) you must include at least one URL
If source material is required to solve your Prompt, provide a justification
If the material is self-generated, specify that in the justification

USE THIS TOOL: https://proctor-reference-builder.lovable.app/ — Then, paste the structured JSON (nothing else) below then submit!

References must be in JSON format with key, target_source, and required_to_solve fields.

Step 11: Provide Hints

Include exactly 3 progressive hints designed to help a qualified expert solve the problem within 1 hour. A hint = a slight or indirect indication or suggestion.

Three Fields:

Hint 1: Provides a productive starting point
Hint 2: Advances meaningfully toward the solution
Hint 3: Gets close to the final answer

Hints must be written as complete thoughts or sentences!

Note: Each hint should be useful on its own. Do not reveal the complete solution in any single hint. Hint #2 cannot reference #1. Hint #3 cannot reference #2 or #1.

Each hint should be standalone (even if you can't actually solve it without getting somewhere from the other hints).

Step 12: Fill in Metadata

Provide:

Subdomain (e.g., Physics / Quantum Physics)
Education level of this task: High-school olympiad, Undergraduate, Graduate
Given the education level, difficulty level: Easy, Medium, Hard

Step 13: Provide Step-by-Step Solution

Provide a complete, step-by-step solution. This is the definitive reference solution that will be used to evaluate and judge model responses.

The quality of your written solutions is especially important. Treat this as if you are communicating at a conference or giving a seminar talk; your reviewers will be peers in or close to your subfield but not necessarily on your topic.

Your solutions should explicitly write out every reasoning step, calculation, and assumption. Include references/images in your task as often as possible!

Please submit the following:

Required: Full Step-by-Step Solution + Final Answer
- Complete and rigorous
- Include intermediate calculations
- Include definitions used
- Include all reasoning steps
- Include logical justifications for each step
Goal: Write this field well so that a reviewer can independently verify the result without needing to infer missing steps.

Principles:
- If a step relies on a known theorem/identity, briefly state it and show how it applies
- Do not skip steps with phrases like "it is obvious" unless you still provide the missing reasoning
- Completeness matters more than brevity (no character limit)
Final Answer
- Be concise
- Verifiable
- Required for numerical or computational problems (e.g., a number, expression, vector, matrix, algorithm output, etc.)
- Optional for proof-based problems
- If included, it should clearly state the conclusion (e.g., "Therefore, statement S holds for all n.")

In-task Quality Checks: Implemented to catch issues early in the tasking pipeline. Found in the side panel on the right. This is to aid your tasking, but please do not treat it as a source of truth.

Step 14: Create Rubrics

Create a structured scoring guide, with 2–7 rubric items, to evaluate future model attempts.

As of Mar 3, 2026: a starting-point rubric will be generated based on your step-by-step solution.

You must make edits to this rubric to ensure LaTeX renders and all logic is correct. A good rubric awards a 7/7 to your step-by-step solution and a 3 or less to the model responses!

Rubric Makeup:

Criterion = milestone being evaluated
Weight = how many points this criterion is worth
Description = why this criterion matters and how it advances the solution
Grading guidance = precise instructions for awarding full or partial credit, including specific steps/expressions required and how to handle common errors

Rubric Quality Checklist:

Rubric: Covers top 7 or fewer major milestones; Defines variables and assumptions clearly; Adheres strictly to prompt instructions
Criteria: Are atomic; Are self-contained; Are verb-led; Last one refers to the final answer
Weights: Reflect relative importance of rubric items; Add up to 7
Grading guidance: Defines tolerance ranges; Awards partial credit wherever applicable

Verb-Led Criteria: Every rubric criterion must begin with a clear action verb. If it does not, rewrite it so it starts with a measurable verb such as: Derives, States, Extracts, Identifies, Explains, Calculates, Cites, Compares

Five Key Rubric Principles:

Verb-Led Criteria — each criterion starts with an action verb
Atomic Criteria — one check per requirement (don't bundle multiple checks into one criterion)
Self-Contained Criteria — each criterion is understandable on its own without needing context from other criteria
Grading Guidance — include clear guidance on what constitutes a pass vs. fail for each criterion
Last Criterion = Final Answer — the final rubric item should check the final answer

There are golden rubric examples available for Biology, Physics, Chemistry, and Engineering in the Rubrics Hub section.

Step 15: Resolve Quality Checks + Submit

You will see Quality Checks in your Task.

These Quality Checks can be WRONG. Do not treat it as a source of truth. It is completely acceptable to stand by your original work and resubmit without changes.

| Stage | Quality Check Present | Description | |---|---|---| | Prompt | Plagiarism | Checks to see if your prompt is original | | | Ambiguous Prompt | Checks if the prompt is asking for something unverifiable | | | Scientific Reasoning Validation | Checks to ensure scientific reasoning is sound | | | Single Question | Checks that the prompt is asking for a single thing | | Step-by-Step Solution | Scientific Reasoning Validation | Checks to ensure scientific reasoning is sound | | Final Answer | Scientific Reasoning Validation | Checks to ensure scientific reasoning is sound | | Rubric (QC for Rubrics checks are quite strong — please focus on these as much as possible and try to address all callouts) | Atomic criterion | Checks for stacked criteria (multiple requirements in one) | | | Clear and explicit grading guidance | Checks that grading criteria includes any explicit score levels or point allocations linked to checkable conditions; bounded criteria free of subjective terms; stated or derivable expected targets/answers; and concrete rules for partial credit (or any scoring) | | | Criterion self contained | Checks to ensure your criteria does not contain outside information | | | Criterion begins with an action verb | Checks to ensure your criterion begins with a verb |

Additional quality checks include: Domain confirmation, plagiarism detection, and more.

7. Rubrics Hub

The goal of each rubric is to establish clear, reliable, and consistent criteria that define what an ideal model response should look like.

Rubrics must be:

Mutually Exclusive – each criterion should evaluate one distinct element without overlapping with others
Collectively Exhaustive – together, the criteria should cover all important aspects of the response, leaving no major gaps in evaluation
Robust and Consistent – the rubric should work reliably across a wide range of responses, minimizing ambiguity while maintaining a binary response

This will enable objective evaluation of varied model outputs in a structured and fair manner.

This will be challenging at first — that is normal! Rubrics require careful reading, thoughtful judgment, and consistent application of criteria across different scenarios. Many fellows find that once they've reviewed a few examples and applied the criteria a handful of times, the process becomes much more intuitive.

8. Domain Specifics

Sub-tabs for domain-specific guidance:

Physics/Astronomy
Chemistry
Biology
Engineering

9. LaTeX Guide

LaTeX is very particular in this project — the LaTeX in the prompt and the answer needs to be correct and match their format.

All numerical values and variables should be written inside $ delimiters and all units must be included inside \text{} in $ so that they are not italicized
For instance, the applied force is $F=123 \text{ N}$ (the space puts a small gap between the number and unit). If you try to instead write $F=1238 N$ in the prompt, that will be sent back immediately.
Additionally, EACH NUMBER should be given to at least the precision requested for the answer (which must also be specified)
For numerical answers, the prompt must specify the units the answer should be in, and for equation answers, all required variables in the answer must be specified in the prompt

Resources:

Project Proctor Guide to LaTeX
Handshake LaTeX Master Guide

All numerical values, variables, and chemical formulas must be formatted in LaTeX with $ delimiters for inline math and $$ delimiters for display (separate line) math.

10. Key Principles

For each submission, you will provide a complete solution, a set of progressive hints, and a detailed scoring rubric.

The core goal is difficulty: your problems should be tough enough to cause substantive reasoning failures in state-of-the-art AI models, going beyond simple computational slips or formatting glitches.

If you're having trouble stumping the model in your early prompts, don't be discouraged! We are trying to write PhD+ level questions — state-of-the-art models have become increasingly good at finding and recalling high-level domain knowledge, but by adding reasoning we challenge the model to understand and apply this knowledge.

What Success Looks Like

A standout submission:

Demands deep conceptual reasoning rather than just surface-level pattern matching
Is completely self-contained and clear, leaving no room for ambiguity
Successfully stumps frontier models by inducing incorrect reasoning paths
Includes a robust rubric that allows for reliable, consistent grading

Non-Goals (What to Avoid)

"Gotcha" questions, trick wording, or adversarial prompt hacking
Problems that can be solved easily via an internet lookup
Stumping models through minor arithmetic errors or ambiguity

General Ideas

Rule of thumb: Would everyone in your field know about this concept? Avoid questions that hit on unique reaction conditions that are only known from one or two papers with minimal citations and very recent publishing.

Data Diversity: Your tasks should use a diverse range of strategies and question formats to stump the model. Do not repeatedly submit tasks which take on the same structure or involve the same concept (e.g., Products of the same reaction/mechanism with different starting materials, etc.).

11. Pre-Submission Checklist

Before submitting each task, verify the following. Missing or incomplete sections will result in rejection.

Prompt Quality (Step 1)

Clarity & Structure:

[ ] The prompt is fully self-contained
[ ] All variables, assumptions, and definitions are explicitly stated
[ ] The prompt is unambiguous
[ ] No hidden assumptions
[ ] The problem does NOT require internet access
[ ] The problem is NOT multiple choice (explicitly or implicitly)

Difficulty:

[ ] At least High School Olympiad level
[ ] Targeting Graduate / PhD-level reasoning
[ ] Requires multi-step reasoning, not routine computation
[ ] Cannot be solved by formula recall alone

Originality:

[ ] Problem is 100% original
[ ] Not copied from textbooks, papers, competitions, or websites
[ ] Not searchable online
[ ] Not a trivial modification (variable/name/number changes only)

Formatting:

[ ] Mathematical notation renders correctly in Markdown
[ ] LaTeX is used properly (preferred)
[ ] No Markdown-breaking formatting errors
[ ] Units formatted correctly (especially Physics/Chemistry domains)
[ ] Images (if used) appear ONLY in the prompt, not in the solution
[ ] Any images used are self-generated or open-access and properly cited

Model Failure Validation (Step 4)

Required Failure Condition:

[ ] At least one response from Model A fails
[ ] At least one response from Model B fails
[ ] The failure is a reasoning error, not just arithmetic

Valid Model Failure (at least one of the following):

[ ] Conceptual misunderstanding
[ ] Incorrect theorem/principle application
[ ] Logical flaw
[ ] Missing critical reasoning step

Failure Rationale Quality (Step 8)

For each failing model:

[ ] Identified the exact reasoning error
[ ] Explained why the reasoning is incorrect
[ ] Referenced the specific step where the failure occurs
[ ] Distinguished reasoning error from calculation error
[ ] Explanation is precise and technical (not vague)

Step-by-Step Solution (Step 13)

Completeness:

[ ] Every reasoning step is explicitly written
[ ] All intermediate calculations shown
[ ] All formulas justified
[ ] All assumptions stated
[ ] No "it is obvious" shortcuts
[ ] Reviewer can independently verify the solution

Accuracy:

[ ] All algebra verified
[ ] All constants checked
[ ] Units consistent
[ ] Final answer matches derivation