Project Watt Handshake Ai Extraction

Source: project-watt-handshake-ai-extraction.md (ingested 2026-03-28)

Here is a complete extraction of all information from Project Watt across both the Handshake AI platform and the Lovable training guide:

PROJECT WATT — Overview

Platform: Handshake AI (HAI) Pay Rate: $75/hr Onboarding Start Date: March 17, 2026 Current Batch: Batch 3 — Deadline: 8,400 tasks by March 23 EOD PST

Project Description: This project focuses on systematically evaluating model responses across multiple quality dimensions. You'll review several model-generated answers and compare them using structured criteria. Based on these evaluations, assign preference rankings that reflect which responses best satisfy the prompt's intent and demonstrate strong reasoning and communication.

Key Stats: 8 Verticals, 6 Models Evaluated per task, 6 Rating Dimensions

Domains: STEM, Coding, and Professional

ONBOARDING STEPS (Handshake AI)

Confidentiality agreement — (completed)
Project Terms — (completed)
Project assessment — 10 minutes. A short assessment to confirm understanding of the project requirements before getting started. Progress doesn't save if you exit early.
Setup payments — (completed)

Progress Path: (1) Complete onboarding → (2) Start tasking, earn money → (3) Do more tasks, earn more

ASSESSMENT DETAILS

Title: Project Watt — Preference Ranking Assessment Format: Open-book, ~30 minutes, 80% pass rate required Content: 9 multiple choice questions + 1 benchmark exercise. One attempt. Training guide password: 5xwatt

Important: Progress is NOT saved between sessions. Complete the assessment in one sitting.

Key Principle: Correctness outweighs polish. A Truthfulness score of 1 matters more than high scores on every other dimension. Overall Quality is a holistic judgment, not an average.

GETTING STARTED (3 Steps)

Step 1 — Join the Platform: Access the HAI platform, accept the invite to your assigned project, accept terms, complete Stripe setup for payments.

Step 2 — Review the Tasking Instructions: Read through Rating Dimensions, Justifications, and Workflow.

Step 3 — Start Tasking: Head to the Tasking Workflow page to learn how to claim and complete tasks.

CORE RULE: Everything Runs Through Your Domain

Your assigned domain is the foundation of all work. Every task you claim, every evaluation you complete, and every justification you write must be within your assigned domain. Do not claim tasks outside your domain under any circumstances. Find your Profile ID on the platform, then look it up in the Domain Finder Sheet.

TASKING WORKFLOW

Claim Policy: You may only hold 1 claimed task at a time and must complete it within 3 hours. Claims exceeding either limit are automatically removed. Claiming tasks outside your domain results in immediate offboarding.

Video Training: "Preference Ranking Walkthrough" — a video demonstrating how to evaluate and rank model responses (3.48K views, ~6 minutes).

TASKING INSTRUCTIONS — Step-by-Step

1. Find & Claim a Task: Navigate to your project dashboard, confirm your domain using the Domain Finder Sheet, use inbox filters to filter by assigned domain, go to available tasks and claim one.

2. Review the Prompt: You'll see a preseeded prompt for that domain/subdomain along with the difficulty level (Easy, Medium, or Hard).

3. Rate Each Response: Evaluate each of the 6 model responses across all 6 dimensions. Provide a 1-2 sentence justification for each rating.

4. Rank the Responses: After rating all 6 model responses, rank them from best to worst based on overall assessment.

RATING DIMENSIONS (A → B → C Process)

A. Review — Read all 5 model responses carefully with the prompt in mind. Before scoring: re-read the prompt (keep requirements fresh), check factual claims (are citations real? are statistics accurate? does the response hallucinate?), spot omissions (did the response miss key parts?), assess structure & tone (is it well-organized and professional?).

B. Rate — Score each response across all 6 dimensions using the 1–3 scale:

General Scale: 1 = Bad, 2 = Mediocre, 3 = Good Verbosity Scale: Too Long, Too Short, Just Right

The 6 Dimensions:

1. Instruction Following — Does the response follow the given instructions?

1 (Bad): Ignores key instructions or answers a different question. Misses or misinterprets multiple requirements.
2 (Mediocre): Addresses the general topic but skips or loosely interprets some parts of the prompt.
3 (Good): Every part of the prompt is addressed as asked. Follows all constraints (format, length, scope).

2. Truthfulness — Is the information provided accurate?

1 (Bad): Multiple factual errors, hallucinated details, or fabricated sources that undermine the response.
2 (Mediocre): Mix of accurate and inaccurate information. Some claims are unverifiable or misleading.
3 (Good): All facts, citations, and claims are accurate. No hallucinations or fabricated details.

3. Verbosity — Is the response appropriately concise or verbose?

Too Long: Excessively padded with filler, repetitive sections, or unnecessary detail that dilutes the answer.
Too Short: Leaves out important information or useful detail that should have been included.
Just Right: Length is well-matched to the complexity of the question. Every sentence adds value.

4. Writing Quality — Is the response clear, natural, well-structured, and appropriate in tone?

Note: You should not dock the model on LaTeX rendering issues unless you have confirmed in another tool such as Overleaf or QuickLaTeX that the response format is truly incorrect.
1 (Bad): Poorly structured, confusing, or riddled with grammar issues. Hard to extract key points.
2 (Mediocre): Readable but could be better organized. Some awkward phrasing or inconsistent formatting.
3 (Good): Clear and well-organized. Professional tone, logical flow, appropriate use of structure.

5. Correctness — Does the response solve the right problem in the right way for this task?

1 (Bad): Misunderstands the task's goal or applies the wrong approach.
2 (Mediocre): Addresses the right problem but the approach or reasoning has gaps.
3 (Good): Solves exactly what the task requires using an appropriate approach.

6. Overall Quality — What is your general assessment of the overall response quality?

Note: This should be a holistic reflection of the response — your score here should be consistent with your other dimension scores.
1 (Bad): Very little usable content. Would not save meaningful time.
2 (Mediocre): Has some useful content but needs significant revision.
3 (Good): You would confidently use this response professionally with little or no editing.

C. Justify (Critical) — Justify your score for each dimension with specific reasoning.

RANKING & JUSTIFICATIONS

This is the most critical step. Your ranking and justification are what the customer values most. Be thoughtful, nuanced, and draw on your domain expertise.

A. Rank — Two ways:

Overall ranking — Rank all 5 model responses on a scale of 1 to 5 (best to worst)
Head-to-head comparison — Rank model responses 2–5 against model response 1

Warning: Head-to-head comparisons must be ranked against Model Response 1 — NOT against your top-ranked choice. Always use Model Response 1 as the baseline.

LaTeX Note: You should not dock the model on LaTeX rendering issues unless you have confirmed in another tool such as Overleaf or QuickLaTeX that the response format is truly incorrect. If you have docked it in your Rating Dimensions, please go back and adjust it.

B. Justify — Write a detailed justification explaining why you ranked the model responses in the order you did:

3–7 sentences per justification — be comprehensive
Point out specific details in the response (exact claims, examples, or omissions)
Reference concrete strengths and weaknesses rather than vague statements
Mention every model response — your justification should address all 5 models

Justification Examples:

Coding (Good): Considers real-world impact and weighs trade-offs contextually (e.g., function naming impacts, debugging steps, speed vs. accuracy).

Law (Good): Compares all three on structure, completeness, and practical utility (e.g., prioritized checklists, risk severity ratings, change-of-control clause coverage).

Medicine (Good): Identifies a patient safety issue across all three responses with clinical specificity (e.g., drug interaction warnings, ACR guidelines compliance, hold duration accuracy).

Finance (Good): Compares analytical depth and methodology rigor across responses (e.g., Sharpe ratio comparisons, volatility adjustments, drawdown analysis).

Bad Examples:

"This answered my question better because it was more thorough." → Says nothing specific — what was "more thorough"?
"Response A seems wrong about some details, so I went with Response B." → Doesn't identify which details or explain why B is more accurate.
"I liked Response B better because the formatting was nicer." → Only addresses surface formatting — no substance comparison.

BATCH II LEARNINGS (Key Takeaways)

1. Rating Contradictions — Your three types of ratings must tell a consistent story

The three evaluation types (dimension scores 1–3, overall ranking 1–6, and pairwise comparisons) are not independent — they must follow a logical flow. Overall Quality score (6th dimension) should inform preference ranking, which should inform pairwise comparisons.

Dimension Ratings (1–3): Rate each response independently in isolation. A 3 = excellent, 1 = clearly falls short. Be consistent. The 6th dimension (Overall Quality) should be reflective of the 5 dimensions before it — it's a summary score, not an independent opinion.

Overall Ranking (1–6): Rankings should be consistent with Overall Quality scores. Use dimension scores as your guide, weighted by what matters for the prompt. Don't rank a response #1 if it scored 1s across multiple dimensions.

Pairwise Comparisons: Every comparison is against Response 1. Recommended approach: write out your 1–6 preference ranked list first, then use it as a reference. If you ranked Response 1 higher than Response 3, prefer Response 1 in that head-to-head. For close calls, lean on the dimensions where they differ most.

Quick Consistency Check:

Does my Overall Quality score for each response align with the other 5 dimension scores?
Does my preference ranking follow my Overall Quality scores?
Do my pairwise comparisons match my preference ranking?

2. Referencing Each Model Response in Justifications

Your final justification must mention and compare all 6 responses. A common Batch II mistake was only discussing the top-ranked or bottom-ranked response. Mention each response by name/number and explain its ranking position. Compare responses to each other.

3. Generic Justifications

Justifications must be specific to the prompt and the actual content of each response.

Signs of a generic justification: Uses LLM-generated language like "well-written," "accurate," or "good quality" without specific evidence. Could be copy-pasted to a completely different task and still make sense. Doesn't reference specific content, facts, or details from the response.

What a strong justification looks like: References specific facts, examples, or sections from the response. Explains why something is good or bad in the context of the prompt. Draws on your domain expertise to evaluate correctness and depth.

FELLOW EXPECTATIONS

Quality Standards: Performance is continuously monitored. If your average quality score falls to 3/5 or below on your last 3 tasks: required remediation (additional training) before continuing, and potential offboarding if quality doesn't improve.

Zero Tolerance Policies (Immediate Offboarding):

LLM Usage: Using AI tools to generate justifications or evaluation content is strictly prohibited.
Tasks Outside Your Domain: You must only claim tasks matching your assigned domain and expertise.
Non-Community Conduct: Violations of HAI community guidelines result in immediate removal.

Summary: Maintain average quality above 3/5 on recent tasks; never use LLM tools; follow all HAI community guidelines; reach out on Slack if struggling.

FAQ

How many tasks should I aim for? As many as you can complete with quality during the timeline. Check your specific project track for goals.
Where do I ask questions? The #project-watt Slack channels.
What if I'm stuck on a task? Post in Slack or bring it to a live support session.
Can I work on multiple verticals? Each fellow is assigned to one vertical. Contact your project lead to discuss cross-vertical work.
How do I get paid? Pay is hourly, sent weekly on Wednesdays after 6pm PT for the previous Mon–Sun period.
What's the time cap per task? 1 hour 50 minutes per task including revisions. Don't leave the timer running while idle.

REFERRAL PROGRAM

Refer and earn up to $300 by inviting others to the platform.

Project Watt — Complete Refined Overview

1. What Is Project Watt?

Project Watt is a paid fellowship on the Handshake AI (HAI) platform where domain experts evaluate and rank AI model responses. You are given a prompt in your domain (STEM, Coding, or Professional), read the model-generated answers, score each one across 6 quality dimensions, rank them from best to worst, and write justifications explaining your decisions. The project spans 8 verticals, and each fellow is assigned to exactly one vertical.

Pay: $75/hr, paid weekly on Wednesdays after 6 pm PT for the previous Mon–Sun period. Current batch: Batch 3 — 8,400 tasks by March 23 EOD PST.

2. Onboarding Pathway

Step 1 — Join the Platform: Accept the project invite on HAI → Accept project terms → Sign the confidentiality agreement → Complete Stripe setup for payments.

Step 2 — Review Training: Read the Tasking Instructions on the Lovable training guide (password: 5xwatt). Covers Rating Dimensions, Justifications, and Workflow. Watch the Preference Ranking Walkthrough video.

Step 3 — Pass the Assessment: 9 multiple-choice questions + 1 benchmark exercise. Open-book, ~30 minutes, 80% pass rate required. One attempt. Progress does not save between sessions — complete in one sitting.

Step 4 — Start Tasking: Confirm your domain via the Domain Finder Sheet, filter tasks by your domain, and begin claiming.

3. Claim Policy & Time Rules

You may hold only 1 claimed task at a time and must complete it within 3 hours. Claims exceeding either limit are automatically removed. The time cap per task is 1 hour 50 minutes including revisions — do not leave the timer running while idle. Claiming tasks outside your domain results in immediate offboarding.

4. Task Workflow (What You Do on Each Task)

Step 1 — Find & Claim: Navigate to your dashboard, filter by your assigned domain, claim one available task.

Step 2 — Review the Prompt: Each task shows a preseeded prompt for your domain/subdomain with a difficulty level (Easy, Medium, or Hard).

Step 3 — Rate Each Response: You receive 5 model responses (the training guide refers to both 5 and 6 in different sections). Evaluate each across all 6 dimensions. Write a 1–2 sentence justification per dimension per response.

Step 4 — Rank the Responses: Rank all responses from best to worst based on your overall assessment.

5. The Three-Phase Evaluation Process

Phase A — Review

Before scoring, read every response thoroughly with the prompt in mind:

Re-read the prompt — keep requirements fresh as you evaluate.
Check factual claims — are citations real? Are statistics accurate? Does the response hallucinate?
Spot omissions — did the response miss key parts of the prompt?
Assess structure & tone — is the response well-organized and professional?

Phase B — Rate

Score each response on all 6 dimensions using the 1–3 scale.

General scale: 1 = Bad, 2 = Mediocre, 3 = Good. Verbosity uses a different scale: Too Long, Too Short, Just Right.

The 6 Dimensions:

① Instruction Following — Does the response follow the given instructions?

1 (Bad): Ignores key instructions or answers a different question. Misses or misinterprets multiple requirements.
2 (Mediocre): Addresses the general topic but skips or loosely interprets some parts of the prompt.
3 (Good): Every part of the prompt is addressed as asked. Follows all constraints (format, length, scope).

② Truthfulness — Is the information provided accurate?

1 (Bad): Multiple factual errors, hallucinated details, or fabricated sources that undermine the response.
2 (Mediocre): Mix of accurate and inaccurate information. Some claims are unverifiable or misleading.
3 (Good): All facts, citations, and claims are accurate. No hallucinations or fabricated details.

③ Verbosity — Is the response appropriately concise or verbose?

Too Long: Excessively padded with filler, repetitive sections, or unnecessary detail that dilutes the answer.
Too Short: Leaves out important information or useful detail that should have been included.
Just Right: Length is well-matched to the complexity of the question. Every sentence adds value.

④ Writing Quality — Is the response clear, natural, well-structured, and appropriate in tone?

1 (Bad): Poorly structured, confusing, or riddled with grammar issues. Hard to extract key points.
2 (Mediocre): Readable but could be better organized. Some awkward phrasing or inconsistent formatting.
3 (Good): Clear and well-organized. Professional tone, logical flow, appropriate use of structure.
Note: Do not dock for LaTeX rendering issues unless confirmed incorrect in Overleaf or QuickLaTeX.

⑤ Correctness — Does the response solve the right problem in the right way for this task?

1 (Bad): Misunderstands the task's goal or applies the wrong approach.
2 (Mediocre): Addresses the right problem but the approach or reasoning has gaps.
3 (Good): Solves exactly what the task requires using an appropriate approach.

⑥ Overall Quality — What is your general assessment of the overall response quality?

1 (Bad): Very little usable content. Would not save meaningful time.
2 (Mediocre): Has some useful content but needs significant revision.
3 (Good): You would confidently use this response professionally with little or no editing.
Note: This is a holistic reflection — it should be consistent with your other 5 dimension scores. It is a summary score, not an independent opinion and not a mathematical average.

Phase C — Justify (Critical)

Write a 1–2 sentence justification for each dimension score with specific reasoning drawn from the actual response content.

6. Ranking & Final Justification

This is the most critical step. Your ranking and justification are what the customer values most.

A. Rank (Two Methods):

Overall ranking — Rank all model responses from 1 (best) to last (worst).
Head-to-head comparison — Compare each response (2, 3, 4, 5…) against Response 1 only. Always use Response 1 as the baseline — never compare against your top-ranked choice.

B. Final Justification:

3–7 sentences — be comprehensive.
Point out specific details (exact claims, examples, or omissions).
Reference concrete strengths and weaknesses, not vague statements.
Mention every model response — your justification must address all of them and explain why each earned its position.

7. The Logical Flow — Consistency Across All Three Rating Types

Your three types of ratings (dimension scores, overall ranking, and pairwise comparisons) must tell a consistent story:

Dimension Scores (1–3) → feed into → Overall Quality (6th dimension) → informs → Preference Ranking (1st to last) → dictates → Pairwise Comparisons (each vs. Response 1)

Consistency rules:

If Response A scored Overall Quality = 3 and Response B scored 1, then A must rank higher than B.
If you ranked Response 1 at #4 and Response 3 at #6, then in the R1 vs R3 pairwise, you must prefer Response 1.
Don't rank a response #1 if it scored 1s across multiple dimensions.
For ties in Overall Quality, break them using the other 5 dimensions, weighting what matters most for the specific prompt.

Quick self-check before submitting: → Does my Overall Quality score for each response align with the other 5 dimension scores? → Does my preference ranking follow my Overall Quality scores? → Do my pairwise comparisons match my preference ranking?

8. Key Principles

Correctness outweighs polish. A Truthfulness score of 1 matters more than high scores on every other dimension. A factually wrong response can never be ranked above a factually sound one just because it reads better.

Overall Quality is holistic, not computed. It's a "would I use this as-is?" judgment informed by all dimensions, not an average or formula.

Do not dock for LaTeX rendering. Unless you've confirmed in Overleaf or QuickLaTeX that the format is truly incorrect, rendering artifacts are not the model's fault.

Domain expertise is your core value. Draw on it in every justification. Generic language that could apply to any task will be flagged.

9. Common Pitfalls to Avoid

Rating contradictions — Ranking a response #1 while giving it low dimension scores, or preferring Response 3 over Response 1 in pairwise when your ranking has Response 1 higher.

Generic justifications — Phrases like "well-written," "more thorough," or "good quality" without citing specific content from the response. If your justification could be copy-pasted to a different task and still make sense, it's too vague.

Incomplete justifications — Only discussing the top and bottom responses. Every model response must be mentioned by name/number with its ranking position explained.

LLM-generated language — Using AI tools to write justifications is strictly prohibited and results in immediate offboarding.

10. Fellow Expectations & Policies

Quality threshold: Average quality score must stay above 3/5 on your last 3 tasks. Falling below triggers required remediation training, and continued low performance leads to offboarding.

Zero tolerance (immediate offboarding, no remediation):

Using AI/LLM tools to generate any part of your work.
Claiming tasks outside your assigned domain.
Violating HAI community guidelines.

Support: Use the #project-watt Slack channels for questions, or attend live support sessions. The team would rather help than offboard.

11. FAQ Quick Reference

Task volume goal: As many as you can complete with quality during the timeline.
Questions: Post in #project-watt Slack.
Stuck on a task: Post in Slack or join a live support session.
Multiple verticals: One vertical per fellow; contact your project lead to discuss changes.
Payment: Hourly, weekly on Wednesdays after 6 pm PT (previous Mon–Sun).
Time cap: 1 hour 50 minutes per task including revisions.