Claude Code Mercor Handshake Audit Session

assetactive

Claude Code Mercor Handshake Audit Session

Source: claude-code-mercor-handshake-audit-session.md (ingested 2026-03-28)

Summary of 135510 character source file.

Windows PowerShell Copyright (C) Microsoft Corporation. All rights reserved.

Install the latest PowerShell for new features and improvements! https://aka.ms/PSWindows

PS C:\Users\mesha\Desktop\GitHub\alawein> claude --dangerously-skip-permissions ▐▛███▜▌ Claude Code v2.1.83 ▝▜█████▛▘ Sonnet 4.6 with high effort · Claude Max ▘▘ ▝▝ ~\Desktop\GitHub\alawein

↑ Opus now defaults to 1M context · 5x more room, same pricing

/model ⎿ Set model to Sonnet 4.6 with high effort > can you codebase audit the mercor and handshake projects? I would like to put both in a canonical format (They are both llm trianing/eval/etc. projects on platforms, and hence have common themes of having requirements, rules, prompts,templates, eval, ruberics, etc.) Let us find the common theme and make sure we come up with the most logical and extendible format/layout/repo structure and minimal set of files and workflows and inventories such that I can develop a third project from handshake (called watts, I can provide details later.) use superpowers writing plans and your best agents/subagents to deploy things and make sure all is meaningful and correct.
● Skill(superpowers:brainstorming) ⎿  Successfully loaded skill ● Now let me audit both repos in parallel.

● 2 Explore agents finished (ctrl+o to expand) ├─ Audit mercor LLM project · 25 tool uses · 78.1k tokens │ ⎿  Done └─ Audit handshake proctor project · 25 tool uses · 86.1k tokens ⎿  Done

● Both audits complete. Here's a high-level synthesis before the first question:


What I found — shared DNA across both projects:

┌────────────────────┬───────────────────────────────────┬────────────────────────────────┬───────────────────────┐ │ Layer │ mercor │ handshake │ Common pattern │ ├────────────────────┼───────────────────────────────────┼────────────────────────────────┼───────────────────────┤ │ │ benchmark bundle (prompt.md, │ task JSON (prompt, solution, │ task with prompt + │ │ Core artifact │ solution.py, rubric.md, hints.md, │ rubric, hints, metadata) │ solution + rubric + │ │ │ analysis.md, meta.json) │ │ hints + metadata │ ├────────────────────┼───────────────────────────────────┼────────────────────────────────┼───────────────────────┤ │ │ failure mode registry + domain │ proctor-rules.json (60+ rules, │ rules that define │ │ Rules/Requirements │ pack ABCs │ 11 sections) │ what a valid task │ │ │ │ │ looks like │ ├────────────────────┼───────────────────────────────────┼────────────────────────────────┼───────────────────────┤ │ Templates │ benchmarks/template/ │ templates/task-template.json │ scaffold for new task │ │ │ │ │ authoring │ ├────────────────────┼───────────────────────────────────┼────────────────────────────────┼───────────────────────┤ │ Validation/QA │ failsafe lint-benchmark (7 │ proctor-check validate (60+ │ linter/validator CLI │ │ │ checks) │ rules, GO/REVIEW/NO-GO) │ │ ├────────────────────┼───────────────────────────────────┼────────────────────────────────┼───────────────────────┤ │ Eval │ failsafe run + failsafe eval + │ session pipeline (collect + │ eval pipeline against │ │ │ rubric scorer │ judge) │ live models │ ├────────────────────┼───────────────────────────────────┼────────────────────────────────┼───────────────────────┤ │ Export/Submit │ Mercor 5-field format │ direct task.json submission │ platform-specific │ │ │ (mercor/field*.md) │ │ export │ ├────────────────────┼───────────────────────────────────┼────────────────────────────────┼───────────────────────┤ │ Inventory │ benchmarks/// │ tasks/inventory.json + │ catalog of task │ │ │ layout │ sessions │ status │ ├────────────────────┼──

[... truncated for PKOS record ...]