Outline
- Learning objectives — difference between done and outcome_attributed; outcomeHook anatomy; AR(1) basics
- Key concept — the closed loop is the differentiator; outcome attribution back to strategy
- Diagram walkthrough — vertical funnel: Objective → Epics → Briefs → Shipped → Metric → AR(1) Recalibration → loops back
- The
outcomeHookfield — how a Brief declares its success criterion (metric, source, window, threshold, predicted delta) - AR(1) forecast model in plain English — Bayesian shrinkage, σ tightens with accurate predictions
- Workflow narrative — Sarah's TTFCV objective recalibrating sprint-by-sprint over 6 months
- Why this is hard for competitors to copy — 3 structural reasons (lifecycle events architectural, tenant-scoped metrics foundational, workspace priors statistical)
- What this enables — 3 buyer-facing capabilities (real attribution QBR, sprint planning that improves, strategic course-correction)
- Further reading — AR1-CALIB-001 framework, workspace prior design, module 5
Learning objectives
After this module you should be able to:
- Explain the difference between
doneandoutcome_attributed - Read an
outcomeHookand predict what metric it will measure - Explain how AR(1) forecast recalibration uses outcome data
- Articulate why this is the PM33 differentiator vs. "AI codegen tools"
Key concept
Most AI development tools stop at "code shipped." PM33 doesn't. The shipping moment opens the outcome_tracking window — a configurable period (default 7-30 days) during which PM33 measures whether the predicted metric movement actually happened. The result feeds back into:
- The AR(1) forecast model (priors get tighter, σ shrinks)
- The capacity model (which kinds of work actually deliver business value)
- The next sprint's prioritization (Briefs in areas with proven impact get weighted higher)
This closed loop is the PM33 differentiator. "AI agents wrote code" is table stakes. "AI-driven strategy → outcome attribution → continuous recalibration" is the product.
Diagram walkthrough
A vertical "outcome funnel" from top to bottom:
- Strategic Objective (top, wide box) — "Reduce TTFCV by 30% by Q4"
- Linked Epics (4 boxes, narrower) — each epic has alignment_score 0-1
- Linked Briefs (12+ boxes, narrowest row) — each Brief has its own
outcomeHook - Shipped Code (bottom, PR badges)
To the right, a feedback loop:
- Metric instrumentation — outcomeHook fires
- Realized vs Forecast — the AR(1) comparison
- Recalibration — model updates μ₀, σ, capacity priors
- → loops back to "Strategic Objective" box, updating its forecast trajectory
Color coding:
- Strategic objective: dark blue
- Epics: medium blue
- Briefs: purple (matching the orchestrator color from slide 1)
- Shipped: green
- Recalibration loop: red (the load-bearing feedback)
The outcomeHook field
Every Brief (optionally) defines an outcomeHook. It tells PM33 how to measure whether the work mattered. Example:
outcomeHook:
metric: sprint_planning_page_p95_ms
source: /api/metrics
window: 7d
attribution_threshold: 0.5 # only credit this Brief if delta > 50%
predicted_delta: -0.85 # prediction: 85% reduction
baseline_window: 7d_pre_deploy
When the PR merges, PM33:
- Records the baseline value of
sprint_planning_page_p95_msfrom the 7 days before merge - Opens the 7-day measurement window
- At the end of the window, computes the delta
- If delta > threshold → credits this Brief in the strategic objective's attribution log
- Compares actual delta to predicted_delta → feeds into AR(1) recalibration
The outcomeHook is optional. Briefs without one (most bug fixes, refactors, internal tooling) skip the outcome window and go straight to done. The framework doesn't force attribution where it doesn't make sense.
AR(1) forecast model in plain English
AR(1) = Auto-Regressive lag-1. It's a Bayesian model that predicts the next sprint's velocity (or in this case, the next Brief's impact) based on:
- Prior expectations (workspace's historical baseline μ₀)
- The last observation (most recent shipped Brief's impact)
- A correlation parameter φ (how much the last observation tells us about the next)
- Uncertainty σ (how confident we are)
The model shrinks σ when predictions are accurate and widens σ when predictions miss. After 10 Briefs in a strategic area, you can confidently predict the next Brief's impact ± σ. After 30 Briefs, σ shrinks further. After 100 Briefs in a workspace, you have a workspace-specific prior μ₀ that's much better than the industry default.
This is what powers "PM33 told me this Brief would move TTFCV by 12% ± 3% and it actually moved it by 14%" — the platform learning the workspace's actual dynamics over time.
Workflow narrative
Sarah's TTFCV objective from Module 1. Six months in, here's what the recalibration loop has produced:
Sprint 1-5 (months 1-2): AR(1) starts with the default workspace prior (μ₀ = 6 SP/Brief impact, σ = 2.55). Predictions are wide.
Sprint 5-15 (months 2-4): After 11 Briefs shipped to the TTFCV area, PM33 has actual data. The recalibrator updates: this workspace's onboarding-area Briefs actually deliver μ = 8.2 SP equivalent impact with σ = 1.4. Tighter, more confident predictions.
Sprint 15-20 (month 5): One Brief misses badly — predicted 12% TTFCV improvement, actually 0%. The AR(1) σ widens slightly. PM33 surfaces this in the strategic objective dashboard: "Recent miss — investigate." Sarah finds the root cause (the feature flag was off in production for the measurement window). Fix the flag, the next observation comes in correctly. σ tightens again.
Sprint 20-25 (month 6): Sarah's quarterly review. The recalibrated forecast shows: "Current trajectory hits 34% TTFCV reduction by end of Q4, ±4% (95% CI). Top 3 contributing Brief areas: onboarding (54% of impact), API-first-call latency (23%), error-message clarity (12%)." This is a real forecast with real uncertainty, not a vibes-based "we're on track."
The recalibration didn't require Sarah to do anything. It happened automatically. The outcome data → model update → priority shift loop ran every sprint.
Why this is hard for competitors to copy
The architecture sounds simple. The execution is hard for a 3 reasons:
-
You need lifecycle events for every transition. Without structured events at every state change, you can't compute outcomes. PM33's event bus is in place because every transition (planned → in_progress → in_review → done → outcome_tracked → outcome_attributed) was designed to emit an event. Bolting this on after the fact is a rewrite.
-
You need tenant-scoped metrics. Attribution only works if you can isolate "this workspace's metric movement" from noise. PM33's RLS + tenant_id-keyed metrics architecture supports this. Multi-tenant SaaS that didn't design for this from day one has the same rewrite problem.
-
You need workspace-specific priors. The default μ₀ doesn't fit every workspace. AR(1) without per-workspace priors gives terrible predictions for atypical workspaces. The
AR1-POOLED-PRIOR-001design in PM33 specifically addresses this with dynamic workspace-median priors and Bayesian shrinkage. This is months of statistical engineering, not a checkbox.
Anyone can ship "AI agents that write code." Few will ship the closed-loop attribution layer.
What this enables
Three concrete capabilities that buyers care about:
-
Quarterly business reviews with real attribution. "We shipped X Briefs against objective Y. Top contributors were A, B, C. Aggregate movement was Z% (predicted W%, actual Z%). Here's the audit trail."
-
Sprint planning that improves. The scheduler weighs Briefs in proven-impact areas higher. The prediction error margin shrinks over time. The team builds confidence that "if PM33 says it'll work, it usually does."
-
Strategic course-correction. If an objective isn't getting closer to its target despite shipping work against it, PM33 surfaces this. The recalibrator's "predicted trajectory" diverges from "target trajectory" — Pam alerts the owner. Caught early, not at quarter-end.
Further reading
docs/frameworks/agent-state-AR1-CALIB-001/— the AR(1) calibration framework.claude/memory/ar1-pooled-prior-001.md— workspace-specific prior design- Module 5: Governance & Trust — the audit trail that makes attribution defensible