Claude Code Mercor/Handshake Audit Session
Claude Code Mercor/Handshake Audit Session
Full Claude Code CLI session transcript performing a codebase audit of the Mercor and Handshake (Proctor) projects, designing a canonical format for LLM training/evaluation projects, and architecting a monorepo strategy for the Watts project.
Provenance
Ingested from C:\Users\mesha\Downloads\claude-code-mercor-handshake-audit-session.md on 2026-03-28. Original file ~172KB. This is a structured summary, not the full content.
Summary
Session Context
A Claude Code v2.1.83 session (Sonnet 4.6 with high effort) auditing two LLM platform contribution projects — Mercor (benchmark generation, Python) and Handshake/Proctor (task authoring + validation, TypeScript) — to find common patterns, design a canonical format, and prepare for a third project (Watts, response evaluation on Handshake AI).
Parallel Audit Results
Two subagents audited both repos simultaneously (~25 tool uses each, ~80K tokens):
| Layer | Mercor | Handshake/Proctor | Common Pattern | |-------|--------|-------------------|----------------| | Core artifact | Benchmark bundle (prompt, solution, rubric, hints, meta) | Task JSON (prompt, solution, rubric, hints, metadata) | Task with prompt + solution + rubric + hints + metadata | | Rules | Failure mode registry + domain pack ABCs | proctor-rules.json (60+ rules, 11 sections) | Rules defining valid task structure | | Templates | benchmarks/_template/ | templates/task-template.json | Scaffold for new task authoring | | Validation | failsafe lint-benchmark (7 checks) | proctor-check validate (60+ rules, GO/REVIEW/NO-GO) | Linter/validator CLI | | Eval | failsafe run + eval + rubric scorer | Session pipeline (collect + judge) | Eval pipeline against live models | | Export | Mercor 5-field format | Direct task.json submission | Platform-specific export | | Inventory | benchmarks/<domain>/<slug>/ | tasks/inventory.json + sessions | Catalog of task status | | API layer | src/api/clients.py (4 providers) | src/api/ (5 providers) | Multi-provider LLM client |
Key Insight
Both projects independently discovered the same split: machine-readable source of truth (proctor-rules.json, failure_modes.py) paired with human-readable guides (CLAUDE.md, DAILY_TEMPLATE.md). This validated the architecture for any project with both human contributors and automated validators.
Watts Project Context
- Platform: Handshake AI (HAI)
- Pay rate: $75/hr
- Workflow: Receive prompt + 6 model responses → rate across 6 dimensions (Instruction Following, Truthfulness, Helpfulness, Harmlessness, Verbosity, Overall Quality) → rank → justify → submit
- Fundamentally different from Proctor: Proctor creates/validates tasks; Watts evaluates/ranks responses (RLHF preference annotation pipeline)
Three Architecture Options Evaluated
- Option A — Convention-Only: 3 independent repos with shared layout spec (no coupling, no code reuse)
- Option B — Handshake HAI Monorepo + Mercor Separate (RECOMMENDED): Proctor and Watts in a single npm workspace monorepo sharing ~40% of codebase (core types, inventory, sessions, API clients, reporters); Mercor stays as separate Python repo
- Option C — Shared Template Generator: Starter repo scaffolding new projects (easy bootstrap but no propagation of improvements)
Canonical Layout Spec (Shared Convention)
Every LLM platform project must have:
<project>/
├── CLAUDE.md
├── AGENTS.md
├── rules/ (rules.json + guides/)
├── schema/ (artifact schema + templates/)
├── tasks/ (inventory.json + examples/ + mine/ + sessions/)
├── src/
├── tests/
└── docs/
Handshake HAI Monorepo Structure
_handshake-hai/
├── packages/
│ ├── core/ (@hai/core — shared types, inventory, sessions, API clients, reporters)
│ ├── proctor/ (@hai/proctor — task authoring validator, 11 sections)
│ └── watts/ (@hai/watts — response rater/ranker, 6 dimensions)
└── docs/
Watts Evaluation Schema
Core artifact evaluation.json with: prompt, domain, difficulty, 6 responses each with ratings across 6 dimensions (1-5 scale), final ranking array, and justification text. Validators check: rating range compliance, ranking array consistency, Truthfulness=1 forces rank-last, no tied ratings across all dimensions, justification length/specificity, and domain match.
Outputs
- Complete canonical layout specification
- Monorepo architecture for Handshake HAI projects
- Watts evaluation schema and validator design
- Migration plan from existing Proctor repo
- Mercor alignment recommendations