Claude Code Mercor/Handshake Audit Session

Full Claude Code CLI session transcript performing a codebase audit of the Mercor and Handshake (Proctor) projects, designing a canonical format for LLM training/evaluation projects, and architecting a monorepo strategy for the Watts project.

Provenance

Ingested from C:\Users\mesha\Downloads\claude-code-mercor-handshake-audit-session.md on 2026-03-28. Original file ~172KB. This is a structured summary, not the full content.

Summary

Session Context

A Claude Code v2.1.83 session (Sonnet 4.6 with high effort) auditing two LLM platform contribution projects — Mercor (benchmark generation, Python) and Handshake/Proctor (task authoring + validation, TypeScript) — to find common patterns, design a canonical format, and prepare for a third project (Watts, response evaluation on Handshake AI).

Parallel Audit Results

Two subagents audited both repos simultaneously (~25 tool uses each, ~80K tokens):

| Layer | Mercor | Handshake/Proctor | Common Pattern | |-------|--------|-------------------|----------------| | Core artifact | Benchmark bundle (prompt, solution, rubric, hints, meta) | Task JSON (prompt, solution, rubric, hints, metadata) | Task with prompt + solution + rubric + hints + metadata | | Rules | Failure mode registry + domain pack ABCs | proctor-rules.json (60+ rules, 11 sections) | Rules defining valid task structure | | Templates | benchmarks/_template/ | templates/task-template.json | Scaffold for new task authoring | | Validation | failsafe lint-benchmark (7 checks) | proctor-check validate (60+ rules, GO/REVIEW/NO-GO) | Linter/validator CLI | | Eval | failsafe run + eval + rubric scorer | Session pipeline (collect + judge) | Eval pipeline against live models | | Export | Mercor 5-field format | Direct task.json submission | Platform-specific export | | Inventory | benchmarks/<domain>/<slug>/ | tasks/inventory.json + sessions | Catalog of task status | | API layer | src/api/clients.py (4 providers) | src/api/ (5 providers) | Multi-provider LLM client |

Key Insight

Both projects independently discovered the same split: machine-readable source of truth (proctor-rules.json, failure_modes.py) paired with human-readable guides (CLAUDE.md, DAILY_TEMPLATE.md). This validated the architecture for any project with both human contributors and automated validators.

Watts Project Context

Platform: Handshake AI (HAI)
Pay rate: $75/hr
Workflow: Receive prompt + 6 model responses → rate across 6 dimensions (Instruction Following, Truthfulness, Helpfulness, Harmlessness, Verbosity, Overall Quality) → rank → justify → submit
Fundamentally different from Proctor: Proctor creates/validates tasks; Watts evaluates/ranks responses (RLHF preference annotation pipeline)

Three Architecture Options Evaluated

Option A — Convention-Only: 3 independent repos with shared layout spec (no coupling, no code reuse)
Option B — Handshake HAI Monorepo + Mercor Separate (RECOMMENDED): Proctor and Watts in a single npm workspace monorepo sharing ~40% of codebase (core types, inventory, sessions, API clients, reporters); Mercor stays as separate Python repo
Option C — Shared Template Generator: Starter repo scaffolding new projects (easy bootstrap but no propagation of improvements)

Canonical Layout Spec (Shared Convention)

Every LLM platform project must have:

<project>/
├── CLAUDE.md
├── AGENTS.md
├── rules/          (rules.json + guides/)
├── schema/         (artifact schema + templates/)
├── tasks/          (inventory.json + examples/ + mine/ + sessions/)
├── src/
├── tests/
└── docs/

Handshake HAI Monorepo Structure

_handshake-hai/
├── packages/
│   ├── core/       (@hai/core — shared types, inventory, sessions, API clients, reporters)
│   ├── proctor/    (@hai/proctor — task authoring validator, 11 sections)
│   └── watts/      (@hai/watts — response rater/ranker, 6 dimensions)
└── docs/

Watts Evaluation Schema

Core artifact evaluation.json with: prompt, domain, difficulty, 6 responses each with ratings across 6 dimensions (1-5 scale), final ranking array, and justification text. Validators check: rating range compliance, ranking array consistency, Truthfulness=1 forces rank-last, no tied ratings across all dimensions, justification length/specificity, and domain match.

Outputs

Complete canonical layout specification
Monorepo architecture for Handshake HAI projects
Watts evaluation schema and validator design
Migration plan from existing Proctor repo
Mercor alignment recommendations