Project Ohm Handshake Ai Overview
Project Ohm Handshake Ai Overview
Source: project-ohm-handshake-ai-overview.md (ingested 2026-03-28)
Project Ohm — Complete Refined Overview
1. What Is Project Ohm?
Project Ohm is a paid fellowship on the Handshake AI (HAI) platform where domain experts evaluate and rank AI model responses. You are given a prompt in your domain (STEM, Coding, or Professional), read 2 model-generated answers, score each one across 6 quality dimensions, compare them head-to-head, and write a justification explaining your preference. The project spans 8 verticals, and each fellow is assigned to exactly one vertical.
Pay: $75/hr, paid weekly on Wednesdays after 6 pm PT for the previous Mon–Sun period. Onboarding Start Date: March 24, 2026 Instructions Site: https://project-ohm.lovable.app
2. Key Differences from Project Watt
| Feature | Project Watt | Project Ohm | |---|---|---| | Models per task | 5–6 | 2 | | Ranking method | Overall ranking 1–5/6 + pairwise vs. Response 1 | Head-to-head comparison only (prefer Response 1 or Response 2) | | Overall ranking | 1st through 6th | 1st through 2nd | | Justification scope | Must reference all 5–6 models | Must reference both models | | Slack channel | #project-watt | #project-ohm | | Extra tools | Domain Finder Sheet | Domain Finder Sheet + Claim Sheet | | Assessment | Via Handshake AI | Via Handshake AI |
Everything else — the 6 rating dimensions, the 1–3 scoring scale, the claim policy, time caps, fellow expectations, and zero-tolerance policies — is identical.
3. Onboarding Pathway
Step 1 — Join the Platform: Accept the project invite on HAI → Accept project terms → Complete Stripe setup for payments.
Step 2 — Review Training: Read the Tasking Instructions on the Project Ohm Lovable guide (https://project-ohm.lovable.app).
Step 3 — Pass the Assessment: Complete the project assessment on the HAI platform (10 min).
Step 4 — Start Tasking: Confirm your domain via the Domain Finder Sheet, use the Claim Sheet tool to find tasks, and begin.
Requirements (from the instructions dialog):
- You may only claim one task at a time.
- Tasks held for more than 3 hours will be removed.
- Claiming tasks outside your domain of expertise may result in removal from the project.
4. Claim Policy & Time Rules
You may hold only 1 claimed task at a time and must complete it within 3 hours. Claims exceeding either limit are automatically removed. The time cap per task is 1 hour 50 minutes including revisions — do not leave the timer running while idle. Claiming tasks outside your domain results in immediate offboarding.
5. Task Workflow (What You Do on Each Task)
Step 1 — Find & Claim: Navigate to your project dashboard, confirm your domain using the Domain Finder Sheet, use inbox filters to filter by assigned domain, claim one available task. You can also use the Claim Sheet tool on the Ohm site (enter your User ID to view available tasks — wait 20 minutes after completing the assessment before using it).
Step 2 — Review the Prompt: Each task shows a preseeded prompt for your domain/subdomain with a difficulty level (Easy, Medium, or Hard).
Step 3 — Rate Each Response: You receive 2 model responses. Evaluate both across all 6 dimensions. Write a 1–2 sentence justification per dimension per response.
Step 4 — Rank the Responses: After rating the 2 model responses, rate your preference for Response 1 or Response 2.
6. The Three-Phase Evaluation Process
Phase A — Review
Read both model responses carefully with the prompt in mind:
- Re-read the prompt — keep requirements fresh as you evaluate.
- Check factual claims — are citations real? Are statistics accurate? Does the response hallucinate?
- Spot omissions — did the response miss key parts of the prompt?
- Assess structure & tone — is the response well-organized and professional?
Phase B — Rate
Score each response on all 6 dimensions using the 1–3 scale.
General scale: 1 = Bad, 2 = Mediocre, 3 = Good. Verbosity uses a different scale: Too Long, Too Short, Just Right.
The 6 Dimensions (identical to Project Watt):
① Instruction Following — Does the response follow the given instructions?
- 1 (Bad): Ignores key instructions or answers a different question. Misses or misinterprets multiple requirements.
- 2 (Mediocre): Addresses the general topic but skips or loosely interprets some parts of the prompt.
- 3 (Good): Every part of the prompt is addressed as asked. Follows all constraints (format, length, scope).
② Truthfulness — Is the information provided accurate?
- 1 (Bad): Multiple factual errors, hallucinated details, or fabricated sources that undermine the response.
- 2 (Mediocre): Mix of accurate and inaccurate information. Some claims are unverifiable or misleading.
- 3 (Good): All facts, citations, and claims are accurate. No hallucinations or fabricated details.
③ Verbosity — Is the response appropriately concise or verbose?
- Too Long: Excessively padded with filler, repetitive sections, or unnecessary detail that dilutes the answer.
- Too Short: Leaves out important information or useful detail that should have been included.
- Just Right: Length is well-matched to the complexity of the question. Every sentence adds value.
④ Writing Quality — Is the response clear, natural, well-structured, and appropriate in tone?
- 1 (Bad): Poorly structured, confusing, or riddled with grammar issues. Hard to extract key points.
- 2 (Mediocre): Readable but could be better organized. Some awkward phrasing or inconsistent formatting.
- 3 (Good): Clear and well-organized. Professional tone, logical flow, appropriate use of structure.
- Note: Do not dock for LaTeX rendering issues unless confirmed incorrect in Overleaf or QuickLaTeX.
⑤ Correctness — Does the response solve the right problem in the right way for this task?
- 1 (Bad): Misunderstands the task's goal or applies the wrong approach.
- 2 (Mediocre): Addresses the right problem but the approach or reasoning has gaps.
- 3 (Good): Solves exactly what the task requires using an appropriate approach.
⑥ Overall Quality — What is your general assessment of the overall response quality?
- 1 (Bad): Very little usable content. Would not save meaningful time.
- 2 (Mediocre): Has some useful content but needs significant revision.
- 3 (Good): You would confidently use this response professionally with little or no editing.
- Note: This is a holistic reflection — it should be consistent with your other 5 dimension scores. It is a summary score, not an independent opinion and not a mathematical average.
Phase C — Justify (Critical)
Write a 1–2 sentence justification for each dimension score with specific reasoning drawn from the actual response content.
7. Ranking & Final Justification
This is the most critical step. Your ranking and justification are what the customer values most.
A. Rank — Head-to-Head Comparison:
Compare the 2 model responses against each other and rate your preference for Response 1 or Response 2.
LaTeX Note: Do not dock for LaTeX rendering issues unless confirmed in Overleaf or QuickLaTeX. If you have docked it in your Rating Dimensions, go back and adjust.
B. Final Justification:
Write a detailed justification explaining why you ranked the model responses in the order you did:
- 3–7 sentences per justification — be comprehensive.
- Point out specific details (exact claims, examples, or omissions).
- Reference concrete strengths and weaknesses, not vague statements.
- Mention every model response — your justification must address both models.
8. The Logical Flow — Consistency Across All Three Rating Types
Your three types of ratings (dimension scores, overall ranking, and pairwise comparison) must tell a consistent story:
Dimension Scores (1–3) → feed into → Overall Quality (6th dimension) → informs → Preference Ranking (1st to 2nd) → dictates → Pairwise Comparison
Consistency rules:
- Rate each response in isolation first — don't compare to the other yet.
- Overall Quality should be reflective of the 5 dimensions before it — it's a summary score, not an independent opinion.
- Rankings should be consistent with your Overall Quality scores — if Response 1 got a 3 and Response 2 got a 1, Response 1 should rank higher.
- Your pairwise choice must be consistent with your overall ranking.
- For close calls, lean on the dimensions where they differ most.
Quick self-check before submitting: → Does my Overall Quality score for each response align with the other 5 dimension scores? → Does my preference ranking follow my Overall Quality scores? → Does my pairwise comparison match my preference ranking?
9. Key Principles
Correctness outweighs polish. A Truthfulness score of 1 matters more than high scores on every other dimension.
Overall Quality is holistic, not computed. It's a "would I use this as-is?" judgment, not an average.
Do not dock for LaTeX rendering unless confirmed incorrect in Overleaf or QuickLaTeX.
Domain expertise is your core value. Draw on it in every justification.
10. Common Pitfalls (Batch II Learnings)
Rating contradictions — Your pairwise preference contradicting your dimension scores or overall ranking.
Generic justifications — Phrases like "well-written," "more thorough," or "good quality" without citing specific content. If your justification could apply to a different task, it's too vague.
Incomplete justifications — Failing to mention and compare both responses. Your justification must reference each response by name and explain their positions.
11. Fellow Expectations & Policies
Quality threshold: Average quality score must stay above 3/5 on your last 3 tasks. Falling below triggers remediation, continued low performance leads to offboarding.
Zero tolerance (immediate offboarding):
- Using AI/LLM tools to generate any part of your work.
- Claiming tasks outside your assigned domain.
- Violating HAI community guidelines.
Support: Use #project-ohm Slack channels or attend live support sessions.
12. FAQ Quick Reference
- Task volume goal: As many as you can complete with quality during the timeline.
- Questions: Post in #project-ohm Slack.
- Stuck on a task: Post in Slack or join a live support session.
- Multiple verticals: One vertical per fellow; contact your project lead.
- Payment: Hourly, weekly on Wednesdays after 6 pm PT (previous Mon–Sun).
- Time cap: 1 hour 50 minutes per task including revisions.
13. Claim Sheet Tool (Ohm-Specific)
Project Ohm includes a Claim Sheet tool on the training site. Enter your unique User ID (Fellow ID) to view tasks routed to you. If you just completed the assessment, wait 20 minutes before using it — the matching algorithm needs time to process your information.
Project Ohm — Comprehensive Documentation
1. Project Overview
Project Ohm is a paid AI training fellowship hosted on the Handshake AI (HAI) platform. Fellows serve as domain experts who evaluate and compare AI model responses to professional-grade prompts. The core task is preference ranking: you read a prompt, review 2 model-generated answers, score each across 6 quality dimensions, choose which response you prefer, and write a detailed justification explaining why.
Platform: Handshake AI (HAI) Instructions Site: https://project-ohm.lovable.app Pay Rate: $75/hr Onboarding Start Date: March 24, 2026 Work Type: Remote, Contract Domains Covered: STEM, Coding, and Professional Verticals: 8 total (each fellow is assigned to exactly one) Models Evaluated Per Task: 2 Rating Dimensions: 6 Slack Channel: #project-ohm
2. Onboarding Pathway
The onboarding process follows three steps, after which you can begin earning.
Step 1 — Join the Platform. Access the HAI platform via the provided link. Accept the invite to Project Ohm. Accept the project terms (this document outlines the terms of the project and supplements your Contractor Agreement). Complete the Stripe setup to receive payments.
Step 2 — Review the Tasking Instructions. Read through all material on the Project Ohm training site at https://project-ohm.lovable.app. This covers Rating Dimensions, Justifications, and Workflow. Before getting started you must carefully review the project instructions. The platform will present an instructions dialog with three requirements you must acknowledge: (1) you may only claim one task at a time, (2) tasks held for more than 3 hours will be removed, and (3) claiming tasks outside your domain of expertise may result in removal from the project.
Step 3 — Start Tasking. Once you've joined the platform and reviewed the instructions, head to the Tasking Workflow page to learn how to claim and complete tasks. You can use the Claim Sheet tool on the Ohm site to view available tasks by entering your User ID.
Earning Path: (1) Complete onboarding → (2) Start tasking, earn money → (3) Do more tasks, earn more.
3. The Domain Rule
Your assigned domain is the foundation of every piece of work you do on this project. Every task you claim, every evaluation you complete, and every justification you write must be within your assigned domain. Your domain determines which tasks are available to you and ensures evaluations are completed by qualified experts. Do not claim tasks outside your domain under any circumstances — doing so results in immediate offboarding.
To find your domain: locate your Profile ID on the platform, then look it up in the Domain Finder Sheet (a linked Google Spreadsheet).
4. Claim Policy and Time Rules
You may hold only 1 claimed task at a time and must complete it within 3 hours. Claims that exceed either limit will be automatically removed. The time cap per task is 1 hour 50 minutes including revisions. Do not leave the timer running while idle.
Claiming tasks outside your domain results in immediate removal from the project with no remediation.
5. Task Workflow — What You Do on Each Task
Each task follows a four-step process.
Step 1 — Find and Claim a Task. Once you have completed onboarding, navigate to your project dashboard. Confirm your domain using the Domain Finder Sheet, then use the inbox filters to filter by your assigned domain. Go to available tasks and claim one. Alternatively, use the Claim Sheet tool on the Ohm training site by entering your User ID.
Step 2 — Review the Prompt. You will see a preseeded prompt for that domain and subdomain along with the difficulty level (Easy, Medium, or Hard).
Step 3 — Rate Each Response. Carefully review how each model responded to the prompt. Evaluate both responses across all 6 rating dimensions, then provide a concise 1–2 sentence justification for each score.
Step 4 — Rank the Responses. After rating the 2 model responses, rate your preference for Response 1 or Response 2.
6. The Three-Phase Evaluation Process
Every task follows the A → B → C evaluation pipeline.
Phase A — Review
Read all 2 model responses carefully with the prompt in mind. Before scoring, read each response thoroughly.
Re-read the prompt: Keep the prompt's requirements fresh in your mind as you evaluate.
Check factual claims: Are citations real? Are statistics accurate? Does the response hallucinate?
Spot omissions: Did the response miss key parts of the prompt?
Assess structure and tone: Is the response well-organized and professional?
Phase B — Rate
Score each response across all 6 dimensions using the 1–3 scale. The general scale is 1 = Bad, 2 = Mediocre, 3 = Good. Verbosity uses a different scale: Too Long, Too Short, Just Right.
Dimension 1 — Instruction Following. Does the response follow the given instructions? A score of 1 (Bad) means the response ignores key instructions or answers a different question, and misses or misinterprets multiple requirements. A score of 2 (Mediocre) means the response addresses the general topic but skips or loosely interprets some parts of the prompt. A score of 3 (Good) means every part of the prompt is addressed as asked, and the response follows all constraints including format, length, and scope.
Dimension 2 — Truthfulness. Is the information provided accurate? A score of 1 (Bad) means there are multiple factual errors, hallucinated details, or fabricated sources that undermine the response. A score of 2 (Mediocre) means a mix of accurate and inaccurate information where some claims are unverifiable or misleading. A score of 3 (Good) means all facts, citations, and claims are accurate with no hallucinations or fabricated details.
Dimension 3 — Verbosity. Is the response appropriately concise or verbose? "Too Long" means the response is excessively padded with filler, repetitive sections, or unnecessary detail that dilutes the answer. "Too Short" means the response leaves out important information or useful detail that should have been included. "Just Right" means the length is well-matched to the complexity of the question and every sentence adds value.
Dimension 4 — Writing Quality. Is the response clear, natural, well-structured, and appropriate in tone? You should not dock the model on LaTeX rendering issues unless you have confirmed in another tool such as Overleaf or QuickLaTeX that the response format is truly incorrect. A score of 1 (Bad) means the response is poorly structured, confusing, or riddled with grammar issues, making it hard to extract key points. A score of 2 (Mediocre) means the response is readable but could be better organized, with some awkward phrasing or inconsistent formatting. A score of 3 (Good) means the response is clear and well-organized with a professional tone, logical flow, and appropriate use of structure.
Dimension 5 — Correctness. Does the response solve the right problem in the right way for this task? A score of 1 (Bad) means the response misunderstands the task's goal or applies the wrong approach. A score of 2 (Mediocre) means the response addresses the right problem but the approach or reasoning has gaps. A score of 3 (Good) means the response solves exactly what the task requires using an appropriate approach.
Dimension 6 — Overall Quality. What is your general assessment of the overall response quality? This should be a holistic reflection of the response — your score here should be consistent with your other dimension scores. It is a summary score, not an independent opinion, and not a mathematical average. A score of 1 (Bad) means very little usable content that would not save meaningful time. A score of 2 (Mediocre) means the response has some useful content but needs significant revision. A score of 3 (Good) means you would confidently use this response professionally with little or no editing.
Phase C — Justify (Critical)
Justify your score for each dimension with specific reasoning. Write a 1–2 sentence justification per dimension per response, drawing from the actual content of the response.
7. Ranking and Final Justification
This is the most critical step. Your ranking and justification are what the customer values most. Be thoughtful, nuanced, and draw on your domain expertise.
A. Rank — Head-to-Head Comparison
Compare the 2 model responses against each other and rate your preference for Response 1 or Response 2. You should not dock the model on LaTeX rendering issues unless you have confirmed in another tool such as Overleaf or QuickLaTeX that the response format is truly incorrect. If you have docked it in your Rating Dimensions, please go back and adjust it.
B. Justify — Write a Detailed Justification
Write a detailed justification explaining why you ranked the model responses in the order you did. Your justification should be 3–7 sentences and be comprehensive. Point out specific details in the response such as exact claims, examples, or omissions. Reference concrete strengths and weaknesses rather than vague statements. Mention every model response — your justification should address both models.
You should not dock the model on LaTeX rendering issues unless confirmed in Overleaf or QuickLaTeX. If you docked it in your Rating Dimensions, go back and adjust.
Justification Examples
Coding (Good): A strong justification considers real-world impact and weighs trade-offs contextually — for example, discussing how one model's function-naming choice has downstream impacts while another retains the current name and provides an intermediate debugging step, then weighing speed versus true debugging value.
Law (Good): A strong justification compares all responses on structure, completeness, and practical utility — for instance, discussing how one response structures a contract review as a prioritized checklist with risk severity ratings while another covers the same points in impractical narrative form, and a third misses key clauses entirely.
Medicine (Good): A strong justification identifies patient safety issues across all responses with clinical specificity — such as noting which response correctly flags a drug interaction warning consistent with guidelines while others omit it or state incorrect hold durations.
Finance (Good): A strong justification compares analytical depth and methodology rigor — for example, discussing Sharpe ratio comparisons across market regimes, whether responses adjust for volatility, and whether drawdown analysis is included.
Bad Examples (what not to do): Saying something like "This answered my question better because it was more thorough" says nothing specific. Saying "Response A seems wrong about some details, so I went with Response B" doesn't identify which details or explain why. Saying "I liked Response B better because the formatting was nicer" only addresses surface formatting with no substance comparison.
8. The Logical Flow — Consistency Across All Rating Types
You evaluate responses in three different ways: dimension scores (1–3), overall ranking (1st to 2nd), and pairwise comparison. These are not independent — they must follow a logical flow. Your Overall Quality score (the 6th dimension) should inform your preference ranking, which should inform your pairwise comparison.
Dimension Ratings (1–3 Scale). Rate each response independently on a 1–3 scale across 6 dimensions. Rate each response in isolation — don't compare to the other yet. A 3 means excellent; a 1 means it clearly falls short. Be consistent: similar performance should receive the same score. The 6th dimension, Overall Quality, should be reflective of the 5 dimensions before it. It's a summary score, not an independent opinion.
Overall Ranking (1st to 2nd). Rank all 2 responses from best to worst. Your Overall Quality scores should directly inform this ranking. If Response 1 got a 3 and Response 2 got a 1, Response 1 should rank higher. Use dimension scores as your guide, but weigh by what matters for the prompt. Don't rank a response first if it scored 1s across multiple dimensions.
Pairwise Comparison. You're deciding which of the 2 responses is better. We recommend writing out your preference ranking first, then using it as a reference when making the pairwise comparison. Your pairwise choice must be consistent with your overall ranking. For close calls, lean on the dimensions where they differ most.
Quick Consistency Check (run this before every submission): Does my Overall Quality score for each response align with the other 5 dimension scores? Does my preference ranking follow my Overall Quality scores? Does my pairwise comparison match my preference ranking?
9. Key Principles
Correctness outweighs polish. A Truthfulness score of 1 matters more than high scores on every other dimension. A factually wrong response can never be preferred over a factually sound one simply because it reads better.
Overall Quality is holistic, not computed. It reflects a "would I use this as-is?" judgment informed by all dimensions, not an average or a formula.
Do not dock for LaTeX rendering. Unless you've confirmed in Overleaf or QuickLaTeX that the format is truly incorrect, rendering artifacts should not be penalized.
Domain expertise is your core value. Draw on it in every justification. Generic language that could apply to any task will be flagged.
10. Common Pitfalls — Learnings from Batch II
Pitfall 1 — Rating Contradictions. Your three types of ratings must tell a consistent story. The most common error is producing dimension scores, an overall ranking, and a pairwise comparison that contradict each other. For example, giving Response 1 higher Overall Quality but then preferring Response 2 in the head-to-head, or ranking a response first despite it scoring 1s across multiple dimensions.
Pitfall 2 — Incomplete Justification References. Your final justification must mention and compare both responses. A common mistake was writing a justification that only discussed one response. Each response must be mentioned by name or number, and you must explain why each earned its position.
Pitfall 3 — Generic Justifications. Vague, template-like justifications that could apply to any task are unacceptable. Your justifications must be specific to the prompt and the actual content of each response. Signs of a generic justification include using LLM-generated language like "well-written," "accurate," or "good quality" without specific evidence; writing something that could be copy-pasted to a completely different task and still make sense; and not referencing specific content, facts, or details from the response. A strong justification references specific facts, examples, or sections from the response, explains why something is good or bad in the context of the prompt, and draws on your domain expertise to evaluate correctness and depth.
11. Fellow Expectations and Policies
Quality Standards. Task quality is continuously monitored. If your average quality score falls to 3/5 or below on your last 3 tasks, you will face required remediation (additional training before continuing) and potential offboarding (removal if quality doesn't improve).
Zero Tolerance Policies (immediate offboarding with no remediation):
LLM Usage. Using AI tools to generate justifications or evaluation content is strictly prohibited.
Tasks Outside Your Domain. You must only claim tasks matching your assigned domain and expertise.
Non-Community Conduct. Violations of HAI community guidelines result in immediate removal.
Summary. Maintain an average quality score above 3/5 on recent tasks. Never use LLM tools to complete any part of your tasks. Follow all HAI platform community guidelines. Reach out on Slack if you're struggling — the team would rather help than offboard.
12. FAQ
How many tasks should I aim for? As many as you can complete with quality during the timeline. Check your specific project track for goals.
Where do I ask questions? The #project-ohm Slack channels. Post your question there and someone will help you.
What if I'm stuck on a task? Post in Slack or bring it to a live support session.
Can I work on multiple verticals? Each fellow is assigned to one vertical. Contact your project lead if you'd like to discuss cross-vertical work.
How do I get paid? Pay is hourly, sent weekly on Wednesdays after 6 pm PT for the previous Mon–Sun period.
What's the time cap per task? 1 hour 50 minutes per task including revisions. Don't leave the timer running while idle.
13. Claim Sheet Tool
Project Ohm includes a Claim Sheet tool accessible from the sidebar of the training site. Enter your unique User ID (your Fellow ID from the HAI platform) to load your personalized claim sheet and view tasks routed to you. If you just filled out the assessment, wait 20 minutes before entering your User ID — the matching algorithm needs time to digest your information and build the task routing. There is also a linked video walkthrough if you are unsure how to find your User ID.
14. Quick-Reference Summary
Per task you will: read 1 prompt, review 2 model responses, score each on 6 dimensions (1–3 scale), write 1–2 sentence justifications per dimension, pick your preferred response in a head-to-head comparison, and write a 3–7 sentence final justification referencing both models.
Time constraints: 1 task at a time, 3-hour claim limit, 1 hour 50 minutes active work cap.
Scoring hierarchy: Truthfulness and Correctness matter most. Overall Quality is holistic — not an average. Dimension scores must logically flow into your preference choice.
Justification rules: Be specific, reference both models, cite concrete evidence, draw on domain expertise, never use generic or LLM-generated language.
Offboarding triggers: Quality below 3/5 on last 3 tasks (remediation first), LLM usage (immediate), out-of-domain claims (immediate), community violations (immediate).