In this module
- Install the harness — one curl line, 30 seconds (callout above)
- Align on what to build — a five-minute conversation with Claude, no formality
- Run five slash commands in order —
/harness-prep→/harness-discovery→/harness-plan→/gauntlet-review→/harness-coordinator - Close out —
/goal Complete Harness and e2e testing
That's the whole flow. The rest of this page is appendix material for the curious — what each skill does internally, a worked example, failure modes we've actually hit, and the inner-mechanics diagram. Skip unless you're building your own harness skills or debugging one that's stuck.
What's actually in a harness
A working definition shared across the industry: Agent = Model + Harness. The model is what you'd download from a vendor; the harness is everything around it — the code, configuration, tools, state, and feedback loops that turn a chat completion into a useful agent.
Each ring below encodes an assumption about what the model alone can't reliably do. As models improve, the assumptions shift — but the rings don't go away.
After Osmani's harness-anatomy framing (2026). See Further reading.
PM33's harness implementation makes specific choices at each ring. Context comes from CLAUDE.md + skill files + per-agent indices. Control flow is the five slash commands you ran above. Action is the standard tool set plus PM33 MCP. Persistence is the per-agent worktree convention + progress.json + audit log. Observation is the gauntlet review + machine-verifiable acceptance criteria + outcome attribution.
Learning objectives
After this module you should be able to:
- Decide when a harness is the right tool (16+ hours, multi-session, 3+ phases)
- Install the harness in 30 seconds
- Take a feature from "idea" to "shipped" using five slash commands
- Verify and close out a finished harness with
/goal
When to use a harness
A harness is overkill for a single bug fix. It's exactly the right tool for a multi-phase feature, an AI integration, a security overhaul, or any project that will run across multiple Claude Code sessions and multiple specialists. Decision rule:
| Use a harness | Don't bother |
|---|---|
| 16+ hours of work | <8-hour single-session feature |
| 3+ logical phases | One-off bug fix |
| Multiple specialties (backend + frontend + DB) | Doc-only changes |
| Architectural decisions worth recording | Unclear requirements (do brainstorming first) |
| High blast radius (schema, auth, core services) | Throwaway prototype |
When in doubt: skip the harness on the first session. If you find yourself losing context between sessions or struggling to coordinate multiple specialists, spin one up — the overhead amortizes quickly for multi-week work.
Step 1 — Align on what to build
Before any slash command, have a normal conversation with Claude in chat. Describe:
- The feature — what you're trying to build
- The constraint — schedule, scope, dependencies
- The outcome — how you'll know it worked (a metric, a behavior, a customer signal)
No template required. Brief Claude the way you'd brief a senior PM friend over coffee. Five minutes is plenty. The slash commands take it from there.
Step 2 — Run the five slash commands
Each picks up where the last left off. You can stop between any two and resume later — the harness state persists.
| # | Command | What it does | Time |
|---|---|---|---|
| 1 | /harness-prep | Phase 0 orchestrator. Sequences discovery (internal codebase audit) + brainstorming (2-4 framings) + conditional research (1+3 search budget) into one enriched discovery doc | 5-15 min |
| 2 | /harness-discovery | Invoked inside /harness-prep — broken out here for transparency. Audits existing code, names what's reusable, calls out load-bearing pre-existing bugs | (within step 1) |
| 3 | /harness-plan | Decomposes the prepared doc into phased Briefs with TDD plans, agent assignments, and acceptance criteria. Files the harness skeleton under docs/frameworks/agent-state-{HARNESS-ID}/ | 10-20 min |
| 4 | /gauntlet-review | Dispatches 3 specialist reviewers (security-auditor + backend-architect + test-automator, by default) in parallel against the spec. Surfaces blockers + integrates findings to Appendix A | 10-15 min |
| 5 | /harness-coordinator | Execute. Orchestrates specialists, dispatches them to per-agent worktrees, monitors progress, handles failure modes. Prompts you only when human judgment is needed | hours to days, autonomous |
You don't need to know what's happening inside any of these. The coordinator prompts you when human judgment is required (approve a tradeoff, pick between two paths, resolve a conflict). Otherwise it runs autonomously across multiple sessions.
Step 3 — Close out
When the coordinator reports the harness complete, run:
/goal Complete Harness and e2e testing
This verifies every phase landed cleanly, runs the full e2e suite, files any residual tech-debt items as PM33 work items via MCP, and marks the harness "delivered" — closing the audit-log loop. The harness directory stays in docs/frameworks/agent-state-{HARNESS-ID}/ as the permanent record.
That's the entire workflow. Five commands, one closing command, and a conversation to bookend each end.
Deeper dive — for the curious
The rest of this page explains what each skill does internally, walks through a real example (PAM-CONTEXT-BUDGET-001), names the failure modes we've actually hit, and shows the inner-mechanics diagram. Skip unless you want to build your own harness skills or are debugging one that's stuck.
The directory layout
Each harness lives at docs/frameworks/agent-state-{HARNESS-ID}/:
docs/frameworks/agent-state-{HARNESS-ID}/
├── README.md ← project summary, phase breakdown, success criteria
├── init.sh ← env validation (run at session start)
├── pm33-agent-progress.json ← feature list with status (the source of truth)
├── claude-progress.txt ← append-only session log
├── decisions/ ← ADRs for architectural choices made mid-harness
└── issues/ ← incident notes for things discovered mid-implementation
/harness-plan creates this structure for you. You almost never edit these files by hand — the slash commands maintain them.
The coordinator + specialists pattern
The harness separates two roles structurally:
- Coordinator — orchestrates, never implements. Reads
progress.json, picks the next feature, dispatches a specialist agent in a per-agent worktree, merges back when done. The coordinator is YOU running the/harness-coordinatorskill (or a Claude Code session that's been told "you're the coordinator"). - Specialists — implement, never coordinate. Each one gets a fresh worktree, a clear Brief, and the
harness-disciplineskill. Specialists are dispatched via the Task tool, run RED → GREEN → REFACTOR → DELIVERY, commit + push from their own worktree, report back with evidence.
Critical: specialists never talk to each other. They each get a Brief, a worktree, and a clear scope. The coordinator is the only thing that holds the cross-specialist context. This is the structural fix for the ABSORPTION class of bugs (where one agent's commit accidentally absorbs another's work).
The harness skills workflow (and why each is load-bearing)
Companion diagram: ../diagrams/03b-harness-skills-workflow.excalidraw
PM33's harness infrastructure is a workflow, not a tool. The workflow is implemented as 5 distinct Claude Code skills, each invoked at a specific moment:
Phase 0 Phases 1-3 Phase 4 Execute
───────── ────────── ───────── ────────────────────
harness-prep → planner → gauntlet-review → coordinator
├─ discovery (always) (integrates into ↓
├─ brainstorming (always) Appendix A) discipline
└─ research (conditional) (per specialist)
Each skill has a specific job. Skipping any one of them is how harnesses fail. They are NOT replacements for each other — they compose. The system is designed assuming all five run.
Recent change (2026-05-29):
harness-prepis the new Phase 0 entry. Previously you'd loadharness-discoverydirectly. Now you loadharness-prep, which orchestrates discovery + a brainstorming pass + a conditional research pass. The output is an enriched discovery doc that gives the planner a much better starting point.
Why harnesses matter (the value prop)
Before going skill-by-skill: why is this elaborate workflow worth it?
PM33 was the first product team to systematically use multi-agent AI development at scale (5+ parallel specialists, 100+ shipped Briefs per quarter). We learned the hard way that AI agents fail in different ways than humans do:
- Human-only software dev has a natural rate-limiter — humans get tired, take breaks, hand off context. The system tolerates ad-hoc workflows.
- AI-driven software dev is a fire hose. Agents work in parallel at 50ms granularity. The same patterns that worked for 1-3 humans break down at 5+ agents. Cross-agent commit absorption (ABSORPTION-002), mocked-out integration tests that hide real bugs (OUTCOMES-001), context drift across multi-session work, specification ambiguity exploding into 10x rework — all of these are AI-era failure modes.
The harness skills exist because we hit each of these failure modes in production and structurally fixed them. Each skill is a checkpoint that catches a specific class of failure BEFORE it cascades.
The cost of running the full workflow on a 20-hour project: ~30 minutes (discovery + planning + gauntlet upfront, coordinator overhead during execution). The cost of NOT running it: 5-15 hours of mid-stream rework, plus the failure modes documented in TECHNICAL_DEBT.md.
The math gets better the larger the project. For a 100-hour cross-team initiative, the harness workflow saves multiples of its overhead.
Skill 1 — harness-prep (Phase 0 entry point)
When to invoke: At the START of every new harness, before drafting the plan. Phase 0.
Load: Skill({ skill: "harness-prep" })
What it does: Orchestrates three sub-skills to produce an enriched discovery doc that the planner consumes:
harness-prep
├─ invokes harness-discovery (always — internal infra audit)
├─ invokes superpowers:brainstorming (always — 2-4 problem framings)
└─ invokes harness-research (conditional — strict 1+3 budget)
↓ produces enriched discovery doc with up to 3 sections
Sub-skill: harness-discovery (always invoked)
Internal audit of the existing codebase. Surfaces:
- Existing services the planned work can reuse (vs. reinvent)
- Schema patterns the new code should follow (vs. fight)
- MCP tools already available (vs. building new ones)
- Tech debt that overlaps with the planned scope (so the harness can address it deliberately or carve it out of scope)
- Active harnesses that touch adjacent code (to coordinate timing)
- Related ADRs already decided (to avoid relitigating)
Sub-skill: superpowers:brainstorming (always invoked)
Generates 2-4 alternative framings of the problem before committing to one. Prevents the "first-idea lock-in" failure mode where the planner anchors on the requester's initial framing without considering whether a different framing would dissolve the problem entirely. Output: framing alternatives ranked by trade-off.
Sub-skill: harness-research (conditionally invoked)
External research (web, papers, vendor docs) only when one of the trigger conditions fires:
- Evaluating a new library/SDK not already in
package.json - Feature touches a fast-moving domain: AI API patterns, security advisories, compliance (SOC2/GDPR/HIPAA), billing rules
- User explicitly asks for outside-codebase context
- Planner cannot write a credible spec without external API knowledge
- Discovery flags a gap as "unclear without external context"
Strict budget: 1 search-specialist agent + 3 WebSearch calls (default). Hard ceiling: 2 agents + 6 WebSearch calls. Findings are decay-tagged (research date, sources, confidence) so they don't get cargo-culted into future plans without re-verification.
Default is no research. If you cannot name the specific scoped question you need answered, skip this step.
Why prep is load-bearing
Without prep, every new harness plan gets drafted in a vacuum. Three failure modes:
- Greenfield assumption — planner assumes capabilities must be built when they actually already exist. Six features in, the specialist discovers 80% of feature 3 already exists as a service. Hours of rework.
- First-idea lock-in — planner commits to the requester's initial framing without considering alternatives. The "right" framing was a quick configuration change; the implemented framing was a 40-hour rewrite.
- Missed domain knowledge — planner doesn't know what the field already learned. Reimplements a known anti-pattern. Gauntlet might catch it; might not.
Output artifact: docs/dogfood/discovery/<slug>.md with up to 3 named sections:
| Section | Source | Always present? |
|---|---|---|
## Findings by area | harness-discovery | Yes |
## Alternatives considered | superpowers:brainstorming | Yes (unless skip-with-rationale) |
## External research | harness-research | Conditional — only when trigger fires |
This doc is the handoff artifact. harness-planner treats it as required input. A plan drafted without this doc is considered incomplete.
Example output: For a "Pam context budget" harness, prep produces:
- Findings by area: existing
TokenizerServicecan be reused;StructuredLoggeralready has PII redaction hooks;MCP-CONTEXT-BUDGET-001was previously filed as tech debt — meaning this harness is now its resolution - Alternatives considered: 3 framings — "compress the prompt" vs. "selective context loading" vs. "structured summarization." Trade-offs ranked. Selective context loading wins on cost/risk.
- External research (conditional): scoped to "recent Anthropic guidance on long-context handling," 2 sections added with decay tags
Planning starts informed, anchored on the right framing, with field knowledge factored in.
Critical anti-pattern: do NOT invoke harness-discovery, superpowers:brainstorming, or harness-research individually when prepping a harness. Invoke harness-prep instead — it handles the sequencing, ensures sections are appended in the right order, and produces a single coherent handoff doc.
Skip conditions for prep itself (document the skip in the plan's Appendix B with rationale):
- Pure compliance / correctness fix with one correct answer (e.g., CVE remediation, schema drift fix)
- Truly greenfield work (new file in a new directory with zero neighbors)
- Re-running a prep that already completed in the same session
Skill 2 — harness-planner
When to invoke: After discovery, when ready to scaffold the harness directory. Phase 1.
Load: Skill({ skill: "harness-planner" })
What it does: Decomposes 16+ hour work into phases and Briefs. Produces:
- The
docs/frameworks/agent-state-{HARNESS-ID}/directory README.md— problem statement, phase breakdown (3-7 phases), success criteria, out-of-scope (explicit)pm33-agent-progress.json— the canonical state file with phases[] containing features[] with status + git_commit fieldsinit.sh— env validation script (run at every session start)- Specialist + LLM tier assignments per feature (using the harness-planner specialist matrix)
- Required skills per Brief
- Estimated effort per phase
Why it's load-bearing: Without a plan that lives in progress.json, there is no shared state across sessions. The coordinator can't know what's done. Specialists lose context across handoffs. The harness degenerates into ad-hoc work that any coordinator could approximate with find + git log + memory — which is to say, badly, and not reliably.
The plan is also the input to the next skill (gauntlet-review). No plan, no review.
Example output: For "Pam context budget" harness, planning produces a 4-phase, 16-feature plan with specialist assignments (4× backend-architect, 6× test-automator, 3× frontend-developer, 3× database-admin), tier assignments (8× sonnet, 6× haiku, 2× opus), and a progress.json that becomes the source of truth for the next 20 hours of work.
Skill 3 — gauntlet-review
When to invoke: After planning, BEFORE any Brief is dispatched to a specialist. Phase 1 gate.
Load: Skill({ skill: "gauntlet-review" })
What it does: Runs an adversarial multi-specialist critique of the plan. Each specialist class reads the plan from their domain perspective:
- architect: is the scope right? Are the phases ordered correctly? Will the dependencies actually flow?
- DBA: what's the schema impact? Migrations needed? Tenant isolation considered? Schema parity addressed?
- security: what bypass paths does this open? Auth/scope changes? Audit log implications?
- UX: what surface changes? Wireframe references included?
- performance: complexity hotspots? p95 risk?
The gauntlet produces three classes of findings:
- Blockers — must be addressed before execution starts (loops back to planner)
- Change requests — should be addressed in the plan (planner updates, no loop back if minor)
- Notes — context the specialists will need (recorded in
decisions/directory)
Why it's load-bearing: A plan that hasn't been stress-tested adversarially WILL have gaps. The gauntlet finds them at planning time — when fixing is cheap. Without it, gaps are found mid-execution by specialists who have to either work around them (introducing tech debt) or escalate (blocking the harness). Either way, the cost is 5-10x what gauntlet would have cost.
The gauntlet is also where adversarial domain knowledge gets surfaced. The planner is one specialist's view. The gauntlet is the cross-specialist sanity check. For complex harnesses (schema + auth + multi-tenant), this is the difference between a clean ship and an OUTCOMES-001-class incident.
Loop-back rule: If gauntlet finds blockers, the plan goes back to harness-planner for revision. The gauntlet runs again on the revised plan. No execution until the gauntlet passes. This is the gate.
Skill 4 — harness-coordinator
When to invoke: At the start of every session where you're running the coordinator role. Phases 2-N.
Load: Skill({ skill: "harness-coordinator" })
Critical rule: The coordinator skill is loaded by the COORDINATOR session — the orchestrator. A coordinator NEVER writes code. It dispatches specialists via the Task tool. This separation is structural.
What it does:
- Reads
progress.jsonat session start, finds the next pending feature - Reads
claude-progress.txtfor cross-session context - Spawns a per-agent git worktree for the specialist
- Dispatches the specialist via
Tasktool with a precise brief - Monitors the specialist's progress (via tool result)
- Merges the specialist's PR back via
git merge --no-ff - Updates
progress.json: feature status →completed, records the git commit SHA - Appends 2 lines to
claude-progress.txt: what shipped, who shipped it - Loops to the next pending feature
- At end of session: writes a session-end summary
Why it's load-bearing: Without the coordinator role being structurally distinct from the implementor role, the coordinator session will inevitably try to "just fix this one thing" or "just verify this output." The context window blows up. The coordinator loses track of the big picture. Features get partially shipped. The progress.json gets stale.
The separation IS the discipline. The coordinator can hold the project-level state because it's not also holding implementation-level state. The specialist can produce high-quality work because it has a narrow scope. This is the same architectural principle as microservices — bounded contexts, explicit interfaces, no shared state.
The "coordinator never implements" rule has been violated exactly enough times in PM33 history to know it's a real failure mode. Every violation traced to a slippery slope: "this is just a one-line config tweak" → "while I'm here let me also fix the test" → "actually let me write the whole feature, it's only 30 minutes" → 3 hours later the coordinator session has shipped buggy work and the progress.json is wrong.
Skill 5 — harness-discipline
When to invoke: By each specialist agent at the start of its execution. Loaded inside Task tool dispatches.
Load: Skill({ skill: "harness-discipline" })
Critical rule: This skill is for SPECIALISTS, not coordinators. Loading it in a coordinator session would conflate the roles.
What it does: Enforces the TDD execution discipline per feature:
- RED phase — the specialist MUST write the failing test FIRST. Confirm it fails for the right reason. No implementation code yet.
- GREEN phase — minimum implementation to pass the test. No extra features. No speculative refactors.
- REFACTOR phase — clean up, ensure no regression in adjacent tests. Decisions documented (e.g., cache vs DB query — why?).
- DELIVERY phase —
npm run type-check,npm run test:locked,npm run lint,npm run cleanup:node-processes,git push,gh pr create.
The skill also enforces:
- Per-agent
GIT_INDEX_FILEactivation before any git operation - Per-agent worktree usage when running parallel with other specialists
- Explicit-paths-only for
git add(nogit add -A) - Progress.json updates on completion
- Reporting back to the coordinator with evidence (PR URL, commit SHA, test count)
Why it's load-bearing: Without TDD discipline, specialists rationalize tests to pass after writing the implementation. Mocks expand to hide real bugs. The OUTCOMES-001 class of bug — code that passed all mocked tests but failed in production on real data — comes from skipping RED.
Also without per-agent git protections, you get ABSORPTION-002 — cross-agent commit absorption. The discipline skill activates the per-agent index BEFORE the specialist's first git add, which is the only point at which absorption can be prevented.
How the skills compose
prep (3 sub-skills) → docs/dogfood/discovery/<slug>.md
docs/dogfood/discovery/<slug>.md → feeds harness-planner (Phases 1-3)
plan draft → feeds gauntlet-review (Phase 4)
gauntlet findings → integrated into plan Appendix A
plan + Appendix A → unlocks coordinator
coordinator dispatches → specialists load discipline
specialists ship → coordinator updates progress.json
The composition is one-way. Each downstream skill assumes the upstream skill has run. The coordinator doesn't validate that gauntlet ran (it can't — gauntlet output is in decisions/ files, not enforced state). But practically, if you skip gauntlet, the harness will hit a blocker mid-execution, and the coordinator will surface it as a failed Task. The system fails gracefully, not silently.
Related skill — pm33-mcp (conventions for PM33 MCP tool usage)
In addition to the 5 harness skills above, PM33 ships a pm33-mcp skill that captures conventions for working with PM33's MCP tools (mcp__pm33-staging__*). This is a Claude Code skill, not a harness phase — it's invoked whenever an agent is about to call PM33 MCP tools.
Load: Skill({ skill: "pm33-mcp" }) (invoke BEFORE the first pm33_* tool call in a session)
What it covers:
- Tool routing — which MCP tool to use for which goal (work item creation, Brief creation, querying backlog, linking objectives, running prioritization)
- Known gaps — the 9 documented gaps in PM33-CREATE-PATH-GAPS-001 (e.g., description field capped at 5000 chars,
score_alignmentreturns keyword similarity rather than objective binding, MCP intermittent disconnect patterns) with workarounds for each - Batch parallel patterns — when to fire 4-8
pm33_create_work_itemcalls in a single message vs. sequence them - Canonical parent UUIDs — known epic UUIDs for tech-debt audit, harness work, Pam infrastructure, etc.
- MCP instability handling — the queue-and-execute pattern for when MCP drops mid-call (queue pending calls in a memory file, continue MCP-independent work, drain queue on reconnect)
- Field-encoding conventions — markdown-to-PM33 mappings (P0/P1/P2/P3 → critical/high/medium/low; "30 min"/"few hours"/"1 day" → 1/2/5 story points)
Why it's a separate skill: PM33 MCP tools have specific gotchas that don't apply to other MCP servers. Encoding these in a skill (rather than CLAUDE.md) means the conventions are loaded only when relevant, not in every session.
The skill is itself an example of the Anthropic guidance — "Load specialized expertise on-demand without bloating every session" (see module 1's Harness Ecosystem section). The pm33-mcp skill is ~5,000 chars; loading it in every session would waste context. Loading it only when about to file work items or query backlog is the right tradeoff.
If you're building your own AI product development platform with MCP tools, expect to ship a similar conventions skill specific to your MCP server.
Common questions
Q: Can a small project skip discovery? A: If the project is < 16 hours, it shouldn't be a harness at all (use a single Brief). For harness-tier work, ALWAYS run discovery. The 30 minutes of overhead averages well against the 5+ hours of mid-stream rework that not-discovery causes.
Q: When does gauntlet need to run again mid-execution? A: If the plan materially changes (new phases added, scope expanded, dependencies revised). For a small in-scope adjustment, the coordinator can decide it doesn't warrant gauntlet. For "we discovered we also need to migrate X" — yes, re-gauntlet.
Q: Can one session play coordinator AND implementor? A: No. The structural separation is the discipline. If you find yourself wanting to "just implement this one thing" from the coordinator session, that's the signal that you've lost the coordinator role. Dispatch a specialist.
Q: Do specialists communicate with each other? A: Never directly. The coordinator is the only one holding cross-specialist context. If two specialists' work needs to compose, the coordinator schedules them sequentially, or the planner architects the work so they don't depend on each other.
Q: What if gauntlet rejects the plan repeatedly? A: That's the system working correctly. Loop back to planner. If gauntlet rejects 3+ times, the underlying scope is wrong — probably needs to be split into multiple smaller harnesses or the team needs to make architectural decisions first.
Workflow narrative — Spinning up "PAM-CONTEXT-BUDGET-001"
Meet a real harness: PAM-CONTEXT-BUDGET-001. The Pam orchestrator's system prompt was overshooting its context budget — symptoms were intermittent silent truncation, occasional 500s on long sessions. A team estimated 18-24 hours of work to investigate, design a fix, and ship across 4-5 phases.
Here's how the harness gets set up.
Step 1 — Pick a HARNESS-ID
The convention is {DOMAIN}-{PROBLEM}-{SEQUENCE}. For this one: PAM-CONTEXT-BUDGET-001. Short, greppable, won't collide with future harnesses.
Step 2 — Create the scaffold
HARNESS_ID="PAM-CONTEXT-BUDGET-001"
mkdir -p docs/frameworks/agent-state-${HARNESS_ID}
cd docs/frameworks/agent-state-${HARNESS_ID}
Step 3 — Write the README
The README is the harness's constitution. It includes:
- Problem statement (what's broken, why now)
- Phase breakdown (3-7 phases, each ~3-6 hours)
- Success criteria (objective, measurable)
- Risks + mitigations
- Out-of-scope (what you'll explicitly NOT do)
Skip the temptation to write a 30-page design doc. The README is a navigational tool, not a spec. Spec lives in individual Briefs.
Step 4 — Bootstrap progress.json
This is the canonical state. Example structure:
{
"harness_id": "PAM-CONTEXT-BUDGET-001",
"created_at": "2026-05-27T09:00:00Z",
"phases": [
{
"id": 1,
"name": "Instrument context-budget measurement",
"status": "in_progress",
"features": [
{ "id": "F1.1", "name": "Add tokenCounter middleware to PamOrchestrator", "status": "completed", "git_commit": "abc1234" },
{ "id": "F1.2", "name": "Emit budget_exceeded structured log event", "status": "in_progress" }
]
},
{
"id": 2,
"name": "Identify top 5 budget consumers",
"status": "pending",
"features": [...]
}
]
}
Every feature has a status (pending | in_progress | completed | blocked) and a git commit SHA when completed. The coordinator queries this file at every session start.
Step 5 — Write init.sh
A bash script that runs at every session start, validates the environment, and prints the current status. Example:
#!/usr/bin/env bash
set -euo pipefail
HARNESS_ID="PAM-CONTEXT-BUDGET-001"
cd "$(git rev-parse --show-toplevel)"
# Validate env
[ -d "docs/frameworks/agent-state-${HARNESS_ID}" ] || { echo "Harness dir missing"; exit 1; }
[ -f "docs/frameworks/agent-state-${HARNESS_ID}/pm33-agent-progress.json" ] || { echo "progress.json missing"; exit 1; }
# Print current state
jq '.phases[] | select(.status != "completed")' \
"docs/frameworks/agent-state-${HARNESS_ID}/pm33-agent-progress.json"
# DB health (if harness touches DB)
npm run db:validate --silent
echo "Harness ${HARNESS_ID} ready."
Step 6 — Activate per-agent git index for the coordinator session
export CLAUDE_AGENT_ID="${HARNESS_ID}-coord"
eval "$(./scripts/git/agent-init.sh)"
This sets GIT_INDEX_FILE to a unique per-agent path. Now git add in this shell can't absorb other sessions' files. The bin/git wrapper additionally refuses stash, dirty checkout, reset --hard, clean -f — the dangerous operations that caused historical incidents.
Step 7 — Start dispatching specialists
The coordinator picks the next pending feature, spawns a worktree for the specialist, dispatches via Task tool:
SPECIALIST_ID="${HARNESS_ID}-F1.2-haiku"
bash scripts/git/spawn-agent-worktree.sh "${SPECIALIST_ID}"
# returns: /path/to/.claude/worktrees/agent-${SPECIALIST_ID}
Then the Task tool dispatch (in the coordinator's Claude Code session):
Task({
subagent_type: "backend-architect",
model: "haiku",
description: "F1.2: Emit budget_exceeded structured log",
prompt: `
**Working directory**: /path/to/.claude/worktrees/agent-${SPECIALIST_ID}
**Branch**: worktree-agent-${SPECIALIST_ID}
**Agent ID**: ${SPECIALIST_ID}
Run before any git or file op:
cd <working directory>
export CLAUDE_AGENT_ID=${SPECIALIST_ID}
eval "$(./scripts/git/agent-init.sh)"
Load Skill({ skill: "harness-discipline" }).
[feature spec + AC + TDD plan]
When done: commit + push, return PR URL + commit SHA + test count.
`
})
The specialist runs the TDD cycle, commits to its own worktree branch, pushes, opens a PR. The coordinator monitors via tool result, then merges via gh pr merge --squash.
Step 8 — Repeat for each feature
The coordinator's loop:
- Read
progress.json, find nextpendingfeature - Spawn worktree, dispatch specialist
- Wait for specialist to complete
- Verify the PR + git log
- Update
progress.json: feature status →completed, record commit SHA - Append a 2-line entry to
claude-progress.txt: what shipped, who shipped it - Loop
Step 9 — End of harness
When all features in progress.json are completed:
- Run delivery validation
- Open the final wrap-up PR if there's a synthesis step
- Update
TECHNICAL_DEBT.mdif any new tech debt was filed during the harness - Append the final session entry to
claude-progress.txt - File a PM33 work item with status
donefor the harness itself
The full set of guard rails
The harness wouldn't work without these structural protections, learned from incident history:
| Protection | What it prevents | Where it's enforced |
|---|---|---|
Per-agent git index (GIT_INDEX_FILE) | Cross-agent commit absorption | scripts/git/agent-init.sh + bin/git wrapper |
| Per-agent worktree | Working tree contamination | scripts/git/spawn-agent-worktree.sh |
bin/git wrapper refuses stash/reset --hard/clean -f | Lost work | bin/git in PATH |
| Pre-commit hook: tree-shrink guard | Empty-index commits that nuke files | .husky/pre-commit |
| Pre-commit hook: deletion delta check | Mass deletion accidents | .husky/pre-commit |
harness-discipline skill: TDD enforcement | Skipping RED phase | Skill checklist |
progress.json as source of truth | "Who's working on what" confusion | Coordinator reads at every step |
Each one of those exists because of a real incident. The harness documentation includes pointers to TECHNICAL_DEBT.md entries describing each. New harnesses inherit all these protections — you don't have to set them up.
Common failure modes + recovery
"Absorption — my commit picked up another session's files"
Symptom: git show --stat HEAD shows files you didn't intend to stage. Common when you forgot to activate per-agent index in a long-running coordinator shell.
Recovery:
git reset --soft HEAD~1 # undo the commit, keep changes
git restore --staged path/to/absorbed/file # unstage the foreign file
git commit -m "fix: <original message>" # recommit with only your files
Then activate per-agent index before continuing:
export CLAUDE_AGENT_ID="<your-id>"
eval "$(./scripts/git/agent-init.sh)"
"Dirty worktree blocks rebase / merge"
Symptom: git pull --rebase fails with "you have unstaged changes." Usually means another concurrent session has uncommitted work in this worktree.
Recovery: do NOT stash (banned for good reason — pop conflicts lose work). Either:
- Wait for the other session to commit
- Create a fresh worktree off
origin/mainand cherry-pick your commit there - Open a PR from your current branch, even though it's mixed — PRs sort out the conflict at merge time
"Specialist finished but PR shows no files"
Symptom: PR opens, gh pr view shows fewer files than expected. The rebase agent or the index management dropped files silently.
Recovery: this is the PR #55 + #56 hotfix pattern. Verify files via gh pr view --json files | jq '.files | length'. If files are missing:
- Find the files in the worktree (they're usually still on disk)
- Spawn a fresh worktree off the latest main
- Copy the missing files in
- Commit + push + open a hotfix PR
The general lesson: always audit git show --stat HEAD and gh pr view --json files after commits + before merges. The Excalidraw/PR sanity check is cheap, the absorption recovery is expensive.
"Coordinator session got confused about which feature to dispatch next"
Symptom: you (or the coordinator) re-dispatches a feature that's already in progress, or skips one that's pending.
Recovery: progress.json is the source of truth. Always re-read it before deciding. If two coordinators are running concurrently (rare but possible), the coordinator mutex (scripts/git/coordinator-mutex.sh) prevents the worst races. Acquire at session start, release at end.
Hands-on (15 minutes)
Spin up a tiny test harness:
HARNESS_ID="DEMO-CURRICULUM-001"
mkdir -p docs/frameworks/agent-state-${HARNESS_ID}
cd docs/frameworks/agent-state-${HARNESS_ID}
# Minimal README
cat > README.md << 'EOF'
# Demo Harness — Curriculum Module 3 Walkthrough
## Problem
Curriculum reader wants to feel the shape of a harness.
## Phases
1. Walk through the structure
2. Read progress.json
3. (Pretend to) dispatch a specialist
## Success criteria
- Reader can find the next pending feature
- Reader understands the coordinator/specialist split
## Out of scope
- Actually shipping any code
EOF
# Minimal progress.json
cat > pm33-agent-progress.json << 'EOF'
{
"harness_id": "DEMO-CURRICULUM-001",
"created_at": "2026-05-27T00:00:00Z",
"phases": [
{
"id": 1, "name": "Walkthrough",
"status": "in_progress",
"features": [
{ "id": "F1.1", "name": "Read this README", "status": "completed" },
{ "id": "F1.2", "name": "Read progress.json", "status": "in_progress" },
{ "id": "F1.3", "name": "Read the coordinator pattern in Module 3", "status": "pending" }
]
}
]
}
EOF
# Read what the coordinator would see
jq '.phases[] | select(.status != "completed") | .features[] | select(.status == "pending")' pm33-agent-progress.json
# Cleanup when done
cd ../../..
rm -rf docs/frameworks/agent-state-${HARNESS_ID}
If the jq query returns the F1.3 feature, you've successfully read what the coordinator reads at every step.
Further reading
Internal
/docs/frameworks/LONG_RUNNING_AGENT_FRAMEWORK.md— the framework spec/docs/frameworks/HARNESS_PROJECT_TEMPLATE.md— copy-paste template/docs/frameworks/agent-state-METRIC-ALIGN-001/— a working production example (11 phases, 55 features)/docs/development/CONCURRENT_AGENT_GIT_MODEL.md— the git protections in depthharness-coordinatorskill — load withSkill({ skill: "harness-coordinator" })when running a coordinator sessionharness-disciplineskill — load by specialists for TDD enforcement- Module 4: Outcome Attribution — what happens after the harness ships
Industry frame
Harness engineering isn't a PM33 invention. The vocabulary and the practice are being codified across the industry in parallel. Two pieces in particular are worth the read — they shaped the language we use in this module.
-
Agent Harness Engineering — Addy Osmani (2026). Origin of the "Agent = Model + Harness" framing used above. Coins the ratchet principle: every line in your
AGENTS.mdshould trace to a specific past failure. Argues that "a decent model with a great harness beats a great model with a bad harness" — most agent failures concentrate in the harness, not the model. The anatomy diagram in the previous section is a recreation of Addy's concentric-layers framing, adapted to PM33's specific ring choices. -
Harness Engineering — Martin Fowler. The cybernetic-control framing of the harness: a governor regulating codebases toward desired states. Distinguishes feed-forward controls (guides —
AGENTS.md, skills, planning) from feedback controls (sensors — tests, linters, AI review). Argues both are required: "Separately, you get either an agent that keeps repeating the same mistakes (feedback-only) or an agent that encodes rules but never finds out whether they worked (feed-forward-only)." Also makes the computational-vs-inferential controls distinction that PM33's gauntlet review (inferential) + machine-verifiable AC (computational) split is built on.
Read these if you're designing your own harness skills — both authors crystallize the conceptual model better than this module can.