Setting Up Your First Harness — Builder Track (Module 3)

In this module

Install the harness — one curl line, 30 seconds (callout above)
Align on what to build — a five-minute conversation with Claude, no formality
Run five slash commands in order — /harness-prep → /harness-discovery → /harness-plan → /gauntlet-review → /harness-coordinator
Close out — /goal Complete Harness and e2e testing

That's the whole flow. The rest of this page is appendix material for the curious — what each skill does internally, a worked example, failure modes we've actually hit, and the inner-mechanics diagram. Skip unless you're building your own harness skills or debugging one that's stuck.

What's actually in a harness

A working definition shared across the industry: Agent = Model + Harness. The model is what you'd download from a vendor; the harness is everything around it — the code, configuration, tools, state, and feedback loops that turn a chat completion into a useful agent.

Each ring below encodes an assumption about what the model alone can't reliably do. As models improve, the assumptions shift — but the rings don't go away.

Anatomy of an agent harness

After Osmani's harness-anatomy framing (2026). See Further reading.

PM33's harness implementation makes specific choices at each ring. Context comes from CLAUDE.md + skill files + per-agent indices. Control flow is the five slash commands you ran above. Action is the standard tool set plus PM33 MCP. Persistence is the per-agent worktree convention + progress.json + audit log. Observation is the gauntlet review + machine-verifiable acceptance criteria + outcome attribution.

Learning objectives

After this module you should be able to:

Decide when a harness is the right tool (16+ hours, multi-session, 3+ phases)
Install the harness in 30 seconds
Take a feature from "idea" to "shipped" using five slash commands
Verify and close out a finished harness with /goal

When to use a harness

A harness is overkill for a single bug fix. It's exactly the right tool for a multi-phase feature, an AI integration, a security overhaul, or any project that will run across multiple Claude Code sessions and multiple specialists. Decision rule:

Use a harness	Don't bother
16+ hours of work	<8-hour single-session feature
3+ logical phases	One-off bug fix
Multiple specialties (backend + frontend + DB)	Doc-only changes
Architectural decisions worth recording	Unclear requirements (do brainstorming first)
High blast radius (schema, auth, core services)	Throwaway prototype

When in doubt: skip the harness on the first session. If you find yourself losing context between sessions or struggling to coordinate multiple specialists, spin one up — the overhead amortizes quickly for multi-week work.

Step 1 — Align on what to build

Before any slash command, have a normal conversation with Claude in chat. Describe:

The feature — what you're trying to build
The constraint — schedule, scope, dependencies
The outcome — how you'll know it worked (a metric, a behavior, a customer signal)

No template required. Brief Claude the way you'd brief a senior PM friend over coffee. Five minutes is plenty. The slash commands take it from there.

Step 2 — Run the five slash commands

Each picks up where the last left off. You can stop between any two and resume later — the harness state persists.

#	Command	What it does	Time
1	`/harness-prep`	Phase 0 orchestrator. Sequences discovery (internal codebase audit) + brainstorming (2-4 framings) + conditional research (1+3 search budget) into one enriched discovery doc	5-15 min
2	`/harness-discovery`	Invoked inside `/harness-prep` — broken out here for transparency. Audits existing code, names what's reusable, calls out load-bearing pre-existing bugs	(within step 1)
3	`/harness-plan`	Decomposes the prepared doc into phased Briefs with TDD plans, agent assignments, and acceptance criteria. Files the harness skeleton under `docs/frameworks/agent-state-{HARNESS-ID}/`	10-20 min
4	`/gauntlet-review`	Dispatches 3 specialist reviewers (security-auditor + backend-architect + test-automator, by default) in parallel against the spec. Surfaces blockers + integrates findings to Appendix A	10-15 min
5	`/harness-coordinator`	Execute. Orchestrates specialists, dispatches them to per-agent worktrees, monitors progress, handles failure modes. Prompts you only when human judgment is needed	hours to days, autonomous

You don't need to know what's happening inside any of these. The coordinator prompts you when human judgment is required (approve a tradeoff, pick between two paths, resolve a conflict). Otherwise it runs autonomously across multiple sessions.

Step 3 — Close out

When the coordinator reports the harness complete, run:

/goal Complete Harness and e2e testing

This verifies every phase landed cleanly, runs the full e2e suite, files any residual tech-debt items as PM33 work items via MCP, and marks the harness "delivered" — closing the audit-log loop. The harness directory stays in docs/frameworks/agent-state-{HARNESS-ID}/ as the permanent record.

That's the entire workflow. Five commands, one closing command, and a conversation to bookend each end.

Deeper dive — for the curious

The rest of this page explains what each skill does internally, walks through a real example (PAM-CONTEXT-BUDGET-001), names the failure modes we've actually hit, and shows the inner-mechanics diagram. Skip unless you want to build your own harness skills or are debugging one that's stuck.

The directory layout

Each harness lives at docs/frameworks/agent-state-{HARNESS-ID}/:

docs/frameworks/agent-state-{HARNESS-ID}/
├── README.md                    ← project summary, phase breakdown, success criteria
├── init.sh                      ← env validation (run at session start)
├── pm33-agent-progress.json     ← feature list with status (the source of truth)
├── claude-progress.txt          ← append-only session log
├── decisions/                   ← ADRs for architectural choices made mid-harness
└── issues/                      ← incident notes for things discovered mid-implementation

/harness-plan creates this structure for you. You almost never edit these files by hand — the slash commands maintain them.

The coordinator + specialists pattern

The harness separates two roles structurally:

Coordinator — orchestrates, never implements. Reads progress.json, picks the next feature, dispatches a specialist agent in a per-agent worktree, merges back when done. The coordinator is YOU running the /harness-coordinator skill (or a Claude Code session that's been told "you're the coordinator").
Specialists — implement, never coordinate. Each one gets a fresh worktree, a clear Brief, and the harness-discipline skill. Specialists are dispatched via the Task tool, run RED → GREEN → REFACTOR → DELIVERY, commit + push from their own worktree, report back with evidence.

Critical: specialists never talk to each other. They each get a Brief, a worktree, and a clear scope. The coordinator is the only thing that holds the cross-specialist context. This is the structural fix for the ABSORPTION class of bugs (where one agent's commit accidentally absorbs another's work).

The harness skills workflow (and why each is load-bearing)

Companion diagram: ../diagrams/03b-harness-skills-workflow.excalidraw

PM33's harness infrastructure is a workflow, not a tool. The workflow is implemented as 5 distinct Claude Code skills, each invoked at a specific moment:

Phase 0                  Phases 1-3       Phase 4              Execute
─────────                ──────────       ─────────            ────────────────────
harness-prep        →    planner     →    gauntlet-review →    coordinator
  ├─ discovery (always)                   (integrates into          ↓
  ├─ brainstorming (always)                Appendix A)         discipline
  └─ research (conditional)                                    (per specialist)

Each skill has a specific job. Skipping any one of them is how harnesses fail. They are NOT replacements for each other — they compose. The system is designed assuming all five run.

Recent change (2026-05-29): harness-prep is the new Phase 0 entry. Previously you'd load harness-discovery directly. Now you load harness-prep, which orchestrates discovery + a brainstorming pass + a conditional research pass. The output is an enriched discovery doc that gives the planner a much better starting point.

Why harnesses matter (the value prop)

Before going skill-by-skill: why is this elaborate workflow worth it?

PM33 was the first product team to systematically use multi-agent AI development at scale (5+ parallel specialists, 100+ shipped Briefs per quarter). We learned the hard way that AI agents fail in different ways than humans do:

Human-only software dev has a natural rate-limiter — humans get tired, take breaks, hand off context. The system tolerates ad-hoc workflows.
AI-driven software dev is a fire hose. Agents work in parallel at 50ms granularity. The same patterns that worked for 1-3 humans break down at 5+ agents. Cross-agent commit absorption (ABSORPTION-002), mocked-out integration tests that hide real bugs (OUTCOMES-001), context drift across multi-session work, specification ambiguity exploding into 10x rework — all of these are AI-era failure modes.

The harness skills exist because we hit each of these failure modes in production and structurally fixed them. Each skill is a checkpoint that catches a specific class of failure BEFORE it cascades.

The cost of running the full workflow on a 20-hour project: ~30 minutes (discovery + planning + gauntlet upfront, coordinator overhead during execution). The cost of NOT running it: 5-15 hours of mid-stream rework, plus the failure modes documented in TECHNICAL_DEBT.md.

The math gets better the larger the project. For a 100-hour cross-team initiative, the harness workflow saves multiples of its overhead.

Skill 1 — `harness-prep` (Phase 0 entry point)

When to invoke: At the START of every new harness, before drafting the plan. Phase 0.

Load: Skill({ skill: "harness-prep" })

What it does: Orchestrates three sub-skills to produce an enriched discovery doc that the planner consumes:

harness-prep
  ├─ invokes harness-discovery        (always — internal infra audit)
  ├─ invokes superpowers:brainstorming (always — 2-4 problem framings)
  └─ invokes harness-research         (conditional — strict 1+3 budget)
       ↓ produces enriched discovery doc with up to 3 sections

Sub-skill: `harness-discovery` (always invoked)

Internal audit of the existing codebase. Surfaces:

Existing services the planned work can reuse (vs. reinvent)
Schema patterns the new code should follow (vs. fight)
MCP tools already available (vs. building new ones)
Tech debt that overlaps with the planned scope (so the harness can address it deliberately or carve it out of scope)
Active harnesses that touch adjacent code (to coordinate timing)
Related ADRs already decided (to avoid relitigating)

Sub-skill: `superpowers:brainstorming` (always invoked)

Generates 2-4 alternative framings of the problem before committing to one. Prevents the "first-idea lock-in" failure mode where the planner anchors on the requester's initial framing without considering whether a different framing would dissolve the problem entirely. Output: framing alternatives ranked by trade-off.

Sub-skill: `harness-research` (conditionally invoked)

External research (web, papers, vendor docs) only when one of the trigger conditions fires:

Evaluating a new library/SDK not already in package.json
Feature touches a fast-moving domain: AI API patterns, security advisories, compliance (SOC2/GDPR/HIPAA), billing rules
User explicitly asks for outside-codebase context
Planner cannot write a credible spec without external API knowledge
Discovery flags a gap as "unclear without external context"

Strict budget: 1 search-specialist agent + 3 WebSearch calls (default). Hard ceiling: 2 agents + 6 WebSearch calls. Findings are decay-tagged (research date, sources, confidence) so they don't get cargo-culted into future plans without re-verification.

Default is no research. If you cannot name the specific scoped question you need answered, skip this step.

Why prep is load-bearing

Without prep, every new harness plan gets drafted in a vacuum. Three failure modes:

Greenfield assumption — planner assumes capabilities must be built when they actually already exist. Six features in, the specialist discovers 80% of feature 3 already exists as a service. Hours of rework.
First-idea lock-in — planner commits to the requester's initial framing without considering alternatives. The "right" framing was a quick configuration change; the implemented framing was a 40-hour rewrite.
Missed domain knowledge — planner doesn't know what the field already learned. Reimplements a known anti-pattern. Gauntlet might catch it; might not.

Output artifact: docs/dogfood/discovery/<slug>.md with up to 3 named sections:

Section	Source	Always present?
`## Findings by area`	harness-discovery	Yes
`## Alternatives considered`	superpowers:brainstorming	Yes (unless skip-with-rationale)
`## External research`	harness-research	Conditional — only when trigger fires

This doc is the handoff artifact. harness-planner treats it as required input. A plan drafted without this doc is considered incomplete.

Example output: For a "Pam context budget" harness, prep produces:

Findings by area: existing TokenizerService can be reused; StructuredLogger already has PII redaction hooks; MCP-CONTEXT-BUDGET-001 was previously filed as tech debt — meaning this harness is now its resolution
Alternatives considered: 3 framings — "compress the prompt" vs. "selective context loading" vs. "structured summarization." Trade-offs ranked. Selective context loading wins on cost/risk.
External research (conditional): scoped to "recent Anthropic guidance on long-context handling," 2 sections added with decay tags

Planning starts informed, anchored on the right framing, with field knowledge factored in.

Critical anti-pattern: do NOT invoke harness-discovery, superpowers:brainstorming, or harness-research individually when prepping a harness. Invoke harness-prep instead — it handles the sequencing, ensures sections are appended in the right order, and produces a single coherent handoff doc.

Skip conditions for prep itself (document the skip in the plan's Appendix B with rationale):

Pure compliance / correctness fix with one correct answer (e.g., CVE remediation, schema drift fix)
Truly greenfield work (new file in a new directory with zero neighbors)
Re-running a prep that already completed in the same session

Skill 2 — `harness-planner`

When to invoke: After discovery, when ready to scaffold the harness directory. Phase 1.

Load: Skill({ skill: "harness-planner" })

What it does: Decomposes 16+ hour work into phases and Briefs. Produces:

The docs/frameworks/agent-state-{HARNESS-ID}/ directory
README.md — problem statement, phase breakdown (3-7 phases), success criteria, out-of-scope (explicit)
pm33-agent-progress.json — the canonical state file with phases[] containing features[] with status + git_commit fields
init.sh — env validation script (run at every session start)
Specialist + LLM tier assignments per feature (using the harness-planner specialist matrix)
Required skills per Brief
Estimated effort per phase

Why it's load-bearing: Without a plan that lives in progress.json, there is no shared state across sessions. The coordinator can't know what's done. Specialists lose context across handoffs. The harness degenerates into ad-hoc work that any coordinator could approximate with find + git log + memory — which is to say, badly, and not reliably.

The plan is also the input to the next skill (gauntlet-review). No plan, no review.

Example output: For "Pam context budget" harness, planning produces a 4-phase, 16-feature plan with specialist assignments (4× backend-architect, 6× test-automator, 3× frontend-developer, 3× database-admin), tier assignments (8× sonnet, 6× haiku, 2× opus), and a progress.json that becomes the source of truth for the next 20 hours of work.

Skill 3 — `gauntlet-review`

When to invoke: After planning, BEFORE any Brief is dispatched to a specialist. Phase 1 gate.

Load: Skill({ skill: "gauntlet-review" })

What it does: Runs an adversarial multi-specialist critique of the plan. Each specialist class reads the plan from their domain perspective:

architect: is the scope right? Are the phases ordered correctly? Will the dependencies actually flow?
DBA: what's the schema impact? Migrations needed? Tenant isolation considered? Schema parity addressed?
security: what bypass paths does this open? Auth/scope changes? Audit log implications?
UX: what surface changes? Wireframe references included?
performance: complexity hotspots? p95 risk?

The gauntlet produces three classes of findings:

Blockers — must be addressed before execution starts (loops back to planner)
Change requests — should be addressed in the plan (planner updates, no loop back if minor)
Notes — context the specialists will need (recorded in decisions/ directory)

Why it's load-bearing: A plan that hasn't been stress-tested adversarially WILL have gaps. The gauntlet finds them at planning time — when fixing is cheap. Without it, gaps are found mid-execution by specialists who have to either work around them (introducing tech debt) or escalate (blocking the harness). Either way, the cost is 5-10x what gauntlet would have cost.

The gauntlet is also where adversarial domain knowledge gets surfaced. The planner is one specialist's view. The gauntlet is the cross-specialist sanity check. For complex harnesses (schema + auth + multi-tenant), this is the difference between a clean ship and an OUTCOMES-001-class incident.

Loop-back rule: If gauntlet finds blockers, the plan goes back to harness-planner for revision. The gauntlet runs again on the revised plan. No execution until the gauntlet passes. This is the gate.

Skill 4 — `harness-coordinator`

When to invoke: At the start of every session where you're running the coordinator role. Phases 2-N.

Load: Skill({ skill: "harness-coordinator" })

Critical rule: The coordinator skill is loaded by the COORDINATOR session — the orchestrator. A coordinator NEVER writes code. It dispatches specialists via the Task tool. This separation is structural.

What it does:

Reads progress.json at session start, finds the next pending feature
Reads claude-progress.txt for cross-session context
Spawns a per-agent git worktree for the specialist
Dispatches the specialist via Task tool with a precise brief
Monitors the specialist's progress (via tool result)
Merges the specialist's PR back via git merge --no-ff
Updates progress.json: feature status → completed, records the git commit SHA
Appends 2 lines to claude-progress.txt: what shipped, who shipped it
Loops to the next pending feature
At end of session: writes a session-end summary

Why it's load-bearing: Without the coordinator role being structurally distinct from the implementor role, the coordinator session will inevitably try to "just fix this one thing" or "just verify this output." The context window blows up. The coordinator loses track of the big picture. Features get partially shipped. The progress.json gets stale.

The separation IS the discipline. The coordinator can hold the project-level state because it's not also holding implementation-level state. The specialist can produce high-quality work because it has a narrow scope. This is the same architectural principle as microservices — bounded contexts, explicit interfaces, no shared state.

The "coordinator never implements" rule has been violated exactly enough times in PM33 history to know it's a real failure mode. Every violation traced to a slippery slope: "this is just a one-line config tweak" → "while I'm here let me also fix the test" → "actually let me write the whole feature, it's only 30 minutes" → 3 hours later the coordinator session has shipped buggy work and the progress.json is wrong.

Skill 5 — `harness-discipline`

When to invoke: By each specialist agent at the start of its execution. Loaded inside Task tool dispatches.

Load: Skill({ skill: "harness-discipline" })

Critical rule: This skill is for SPECIALISTS, not coordinators. Loading it in a coordinator session would conflate the roles.

What it does: Enforces the TDD execution discipline per feature:

RED phase — the specialist MUST write the failing test FIRST. Confirm it fails for the right reason. No implementation code yet.
GREEN phase — minimum implementation to pass the test. No extra features. No speculative refactors.
REFACTOR phase — clean up, ensure no regression in adjacent tests. Decisions documented (e.g., cache vs DB query — why?).
DELIVERY phase — npm run type-check, npm run test:locked, npm run lint, npm run cleanup:node-processes, git push, gh pr create.

The skill also enforces:

Per-agent GIT_INDEX_FILE activation before any git operation
Per-agent worktree usage when running parallel with other specialists
Explicit-paths-only for git add (no git add -A)
Progress.json updates on completion
Reporting back to the coordinator with evidence (PR URL, commit SHA, test count)

Why it's load-bearing: Without TDD discipline, specialists rationalize tests to pass after writing the implementation. Mocks expand to hide real bugs. The OUTCOMES-001 class of bug — code that passed all mocked tests but failed in production on real data — comes from skipping RED.

Also without per-agent git protections, you get ABSORPTION-002 — cross-agent commit absorption. The discipline skill activates the per-agent index BEFORE the specialist's first git add, which is the only point at which absorption can be prevented.

How the skills compose

prep (3 sub-skills)                →  docs/dogfood/discovery/<slug>.md
docs/dogfood/discovery/<slug>.md   →  feeds harness-planner (Phases 1-3)
plan draft                         →  feeds gauntlet-review (Phase 4)
gauntlet findings                  →  integrated into plan Appendix A
plan + Appendix A                  →  unlocks coordinator
coordinator dispatches             →  specialists load discipline
specialists ship                   →  coordinator updates progress.json

The composition is one-way. Each downstream skill assumes the upstream skill has run. The coordinator doesn't validate that gauntlet ran (it can't — gauntlet output is in decisions/ files, not enforced state). But practically, if you skip gauntlet, the harness will hit a blocker mid-execution, and the coordinator will surface it as a failed Task. The system fails gracefully, not silently.

In addition to the 5 harness skills above, PM33 ships a pm33-mcp skill that captures conventions for working with PM33's MCP tools (mcp__pm33-staging__*). This is a Claude Code skill, not a harness phase — it's invoked whenever an agent is about to call PM33 MCP tools.

Load: Skill({ skill: "pm33-mcp" }) (invoke BEFORE the first pm33_* tool call in a session)

What it covers:

Tool routing — which MCP tool to use for which goal (work item creation, Brief creation, querying backlog, linking objectives, running prioritization)
Known gaps — the 9 documented gaps in PM33-CREATE-PATH-GAPS-001 (e.g., description field capped at 5000 chars, score_alignment returns keyword similarity rather than objective binding, MCP intermittent disconnect patterns) with workarounds for each
Batch parallel patterns — when to fire 4-8 pm33_create_work_item calls in a single message vs. sequence them
Canonical parent UUIDs — known epic UUIDs for tech-debt audit, harness work, Pam infrastructure, etc.
MCP instability handling — the queue-and-execute pattern for when MCP drops mid-call (queue pending calls in a memory file, continue MCP-independent work, drain queue on reconnect)
Field-encoding conventions — markdown-to-PM33 mappings (P0/P1/P2/P3 → critical/high/medium/low; "30 min"/"few hours"/"1 day" → 1/2/5 story points)

Why it's a separate skill: PM33 MCP tools have specific gotchas that don't apply to other MCP servers. Encoding these in a skill (rather than CLAUDE.md) means the conventions are loaded only when relevant, not in every session.

The skill is itself an example of the Anthropic guidance — "Load specialized expertise on-demand without bloating every session" (see module 1's Harness Ecosystem section). The pm33-mcp skill is ~5,000 chars; loading it in every session would waste context. Loading it only when about to file work items or query backlog is the right tradeoff.

If you're building your own AI product development platform with MCP tools, expect to ship a similar conventions skill specific to your MCP server.

Common questions

Q: Can a small project skip discovery? A: If the project is < 16 hours, it shouldn't be a harness at all (use a single Brief). For harness-tier work, ALWAYS run discovery. The 30 minutes of overhead averages well against the 5+ hours of mid-stream rework that not-discovery causes.

Q: When does gauntlet need to run again mid-execution? A: If the plan materially changes (new phases added, scope expanded, dependencies revised). For a small in-scope adjustment, the coordinator can decide it doesn't warrant gauntlet. For "we discovered we also need to migrate X" — yes, re-gauntlet.

Q: Can one session play coordinator AND implementor? A: No. The structural separation is the discipline. If you find yourself wanting to "just implement this one thing" from the coordinator session, that's the signal that you've lost the coordinator role. Dispatch a specialist.

Q: Do specialists communicate with each other? A: Never directly. The coordinator is the only one holding cross-specialist context. If two specialists' work needs to compose, the coordinator schedules them sequentially, or the planner architects the work so they don't depend on each other.

Q: What if gauntlet rejects the plan repeatedly? A: That's the system working correctly. Loop back to planner. If gauntlet rejects 3+ times, the underlying scope is wrong — probably needs to be split into multiple smaller harnesses or the team needs to make architectural decisions first.

Workflow narrative — Spinning up "PAM-CONTEXT-BUDGET-001"

Meet a real harness: PAM-CONTEXT-BUDGET-001. The Pam orchestrator's system prompt was overshooting its context budget — symptoms were intermittent silent truncation, occasional 500s on long sessions. A team estimated 18-24 hours of work to investigate, design a fix, and ship across 4-5 phases.

Here's how the harness gets set up.

Step 1 — Pick a HARNESS-ID

The convention is {DOMAIN}-{PROBLEM}-{SEQUENCE}. For this one: PAM-CONTEXT-BUDGET-001. Short, greppable, won't collide with future harnesses.

Step 2 — Create the scaffold

HARNESS_ID="PAM-CONTEXT-BUDGET-001"
mkdir -p docs/frameworks/agent-state-${HARNESS_ID}
cd docs/frameworks/agent-state-${HARNESS_ID}

Step 3 — Write the README

The README is the harness's constitution. It includes:

Problem statement (what's broken, why now)
Phase breakdown (3-7 phases, each ~3-6 hours)
Success criteria (objective, measurable)
Risks + mitigations
Out-of-scope (what you'll explicitly NOT do)

Skip the temptation to write a 30-page design doc. The README is a navigational tool, not a spec. Spec lives in individual Briefs.

Step 4 — Bootstrap `progress.json`

This is the canonical state. Example structure:

{
  "harness_id": "PAM-CONTEXT-BUDGET-001",
  "created_at": "2026-05-27T09:00:00Z",
  "phases": [
    {
      "id": 1,
      "name": "Instrument context-budget measurement",
      "status": "in_progress",
      "features": [
        { "id": "F1.1", "name": "Add tokenCounter middleware to PamOrchestrator", "status": "completed", "git_commit": "abc1234" },
        { "id": "F1.2", "name": "Emit budget_exceeded structured log event", "status": "in_progress" }
      ]
    },
    {
      "id": 2,
      "name": "Identify top 5 budget consumers",
      "status": "pending",
      "features": [...]
    }
  ]
}

Every feature has a status (pending | in_progress | completed | blocked) and a git commit SHA when completed. The coordinator queries this file at every session start.

Step 5 — Write `init.sh`

A bash script that runs at every session start, validates the environment, and prints the current status. Example:

#!/usr/bin/env bash
set -euo pipefail

HARNESS_ID="PAM-CONTEXT-BUDGET-001"
cd "$(git rev-parse --show-toplevel)"

# Validate env
[ -d "docs/frameworks/agent-state-${HARNESS_ID}" ] || { echo "Harness dir missing"; exit 1; }
[ -f "docs/frameworks/agent-state-${HARNESS_ID}/pm33-agent-progress.json" ] || { echo "progress.json missing"; exit 1; }

# Print current state
jq '.phases[] | select(.status != "completed")' \
  "docs/frameworks/agent-state-${HARNESS_ID}/pm33-agent-progress.json"

# DB health (if harness touches DB)
npm run db:validate --silent

echo "Harness ${HARNESS_ID} ready."

Step 6 — Activate per-agent git index for the coordinator session

export CLAUDE_AGENT_ID="${HARNESS_ID}-coord"
eval "$(./scripts/git/agent-init.sh)"

This sets GIT_INDEX_FILE to a unique per-agent path. Now git add in this shell can't absorb other sessions' files. The bin/git wrapper additionally refuses stash, dirty checkout, reset --hard, clean -f — the dangerous operations that caused historical incidents.

Step 7 — Start dispatching specialists

The coordinator picks the next pending feature, spawns a worktree for the specialist, dispatches via Task tool:

SPECIALIST_ID="${HARNESS_ID}-F1.2-haiku"
bash scripts/git/spawn-agent-worktree.sh "${SPECIALIST_ID}"
# returns: /path/to/.claude/worktrees/agent-${SPECIALIST_ID}

Then the Task tool dispatch (in the coordinator's Claude Code session):

Task({
  subagent_type: "backend-architect",
  model: "haiku",
  description: "F1.2: Emit budget_exceeded structured log",
  prompt: `
    **Working directory**: /path/to/.claude/worktrees/agent-${SPECIALIST_ID}
    **Branch**: worktree-agent-${SPECIALIST_ID}
    **Agent ID**: ${SPECIALIST_ID}

    Run before any git or file op:
      cd <working directory>
      export CLAUDE_AGENT_ID=${SPECIALIST_ID}
      eval "$(./scripts/git/agent-init.sh)"

    Load Skill({ skill: "harness-discipline" }).

    [feature spec + AC + TDD plan]

    When done: commit + push, return PR URL + commit SHA + test count.
  `
})

The specialist runs the TDD cycle, commits to its own worktree branch, pushes, opens a PR. The coordinator monitors via tool result, then merges via gh pr merge --squash.

Step 8 — Repeat for each feature

The coordinator's loop:

Read progress.json, find next pending feature
Spawn worktree, dispatch specialist
Wait for specialist to complete
Verify the PR + git log
Update progress.json: feature status → completed, record commit SHA
Append a 2-line entry to claude-progress.txt: what shipped, who shipped it
Loop

Step 9 — End of harness

When all features in progress.json are completed:

Run delivery validation
Open the final wrap-up PR if there's a synthesis step
Update TECHNICAL_DEBT.md if any new tech debt was filed during the harness
Append the final session entry to claude-progress.txt
File a PM33 work item with status done for the harness itself

The full set of guard rails

The harness wouldn't work without these structural protections, learned from incident history:

Protection	What it prevents	Where it's enforced
Per-agent git index (`GIT_INDEX_FILE`)	Cross-agent commit absorption	`scripts/git/agent-init.sh` + `bin/git` wrapper
Per-agent worktree	Working tree contamination	`scripts/git/spawn-agent-worktree.sh`
`bin/git` wrapper refuses `stash`/`reset --hard`/`clean -f`	Lost work	`bin/git` in PATH
Pre-commit hook: tree-shrink guard	Empty-index commits that nuke files	`.husky/pre-commit`
Pre-commit hook: deletion delta check	Mass deletion accidents	`.husky/pre-commit`
`harness-discipline` skill: TDD enforcement	Skipping RED phase	Skill checklist
`progress.json` as source of truth	"Who's working on what" confusion	Coordinator reads at every step

Each one of those exists because of a real incident. The harness documentation includes pointers to TECHNICAL_DEBT.md entries describing each. New harnesses inherit all these protections — you don't have to set them up.

Common failure modes + recovery

"Absorption — my commit picked up another session's files"

Symptom: git show --stat HEAD shows files you didn't intend to stage. Common when you forgot to activate per-agent index in a long-running coordinator shell.

Recovery:

git reset --soft HEAD~1                            # undo the commit, keep changes
git restore --staged path/to/absorbed/file         # unstage the foreign file
git commit -m "fix: <original message>"            # recommit with only your files

Then activate per-agent index before continuing:

export CLAUDE_AGENT_ID="<your-id>"
eval "$(./scripts/git/agent-init.sh)"

"Dirty worktree blocks rebase / merge"

Symptom: git pull --rebase fails with "you have unstaged changes." Usually means another concurrent session has uncommitted work in this worktree.

Recovery: do NOT stash (banned for good reason — pop conflicts lose work). Either:

Wait for the other session to commit
Create a fresh worktree off origin/main and cherry-pick your commit there
Open a PR from your current branch, even though it's mixed — PRs sort out the conflict at merge time

"Specialist finished but PR shows no files"

Symptom: PR opens, gh pr view shows fewer files than expected. The rebase agent or the index management dropped files silently.

Recovery: this is the PR #55 + #56 hotfix pattern. Verify files via gh pr view --json files | jq '.files | length'. If files are missing:

Find the files in the worktree (they're usually still on disk)
Spawn a fresh worktree off the latest main
Copy the missing files in
Commit + push + open a hotfix PR

The general lesson: always audit git show --stat HEAD and gh pr view --json files after commits + before merges. The Excalidraw/PR sanity check is cheap, the absorption recovery is expensive.

"Coordinator session got confused about which feature to dispatch next"

Symptom: you (or the coordinator) re-dispatches a feature that's already in progress, or skips one that's pending.

Recovery: progress.json is the source of truth. Always re-read it before deciding. If two coordinators are running concurrently (rare but possible), the coordinator mutex (scripts/git/coordinator-mutex.sh) prevents the worst races. Acquire at session start, release at end.

Hands-on (15 minutes)

Spin up a tiny test harness:

HARNESS_ID="DEMO-CURRICULUM-001"
mkdir -p docs/frameworks/agent-state-${HARNESS_ID}
cd docs/frameworks/agent-state-${HARNESS_ID}

# Minimal README
cat > README.md << 'EOF'
# Demo Harness — Curriculum Module 3 Walkthrough

## Problem
Curriculum reader wants to feel the shape of a harness.

## Phases
1. Walk through the structure
2. Read progress.json
3. (Pretend to) dispatch a specialist

## Success criteria
- Reader can find the next pending feature
- Reader understands the coordinator/specialist split

## Out of scope
- Actually shipping any code
EOF

# Minimal progress.json
cat > pm33-agent-progress.json << 'EOF'
{
  "harness_id": "DEMO-CURRICULUM-001",
  "created_at": "2026-05-27T00:00:00Z",
  "phases": [
    {
      "id": 1, "name": "Walkthrough",
      "status": "in_progress",
      "features": [
        { "id": "F1.1", "name": "Read this README", "status": "completed" },
        { "id": "F1.2", "name": "Read progress.json", "status": "in_progress" },
        { "id": "F1.3", "name": "Read the coordinator pattern in Module 3", "status": "pending" }
      ]
    }
  ]
}
EOF

# Read what the coordinator would see
jq '.phases[] | select(.status != "completed") | .features[] | select(.status == "pending")' pm33-agent-progress.json

# Cleanup when done
cd ../../..
rm -rf docs/frameworks/agent-state-${HARNESS_ID}

If the jq query returns the F1.3 feature, you've successfully read what the coordinator reads at every step.

Try the harness on your machine