Outcome Attribution & Recalibration — Builder Track (Module 4)

Outline

Learning objectives — difference between done and outcome_attributed; outcomeHook anatomy; AR(1) basics
Key concept — the closed loop is the differentiator; outcome attribution back to strategy
Diagram walkthrough — vertical funnel: Objective → Epics → Briefs → Shipped → Metric → AR(1) Recalibration → loops back
The outcomeHook field — how a Brief declares its success criterion (metric, source, window, threshold, predicted delta)
AR(1) forecast model in plain English — Bayesian shrinkage, σ tightens with accurate predictions
Workflow narrative — Sarah's TTFCV objective recalibrating sprint-by-sprint over 6 months
Why this is hard for competitors to copy — 3 structural reasons (lifecycle events architectural, tenant-scoped metrics foundational, workspace priors statistical)
What this enables — 3 buyer-facing capabilities (real attribution QBR, sprint planning that improves, strategic course-correction)
Further reading — AR1-CALIB-001 framework, workspace prior design, module 5

Learning objectives

After this module you should be able to:

Explain the difference between done and outcome_attributed
Read an outcomeHook and predict what metric it will measure
Explain how AR(1) forecast recalibration uses outcome data
Articulate why this is the PM33 differentiator vs. "AI codegen tools"

Key concept

Most AI development tools stop at "code shipped." PM33 doesn't. The shipping moment opens the outcome_tracking window — a configurable period (default 7-30 days) during which PM33 measures whether the predicted metric movement actually happened. The result feeds back into:

The AR(1) forecast model (priors get tighter, σ shrinks)
The capacity model (which kinds of work actually deliver business value)
The next sprint's prioritization (Briefs in areas with proven impact get weighted higher)

This closed loop is the PM33 differentiator. "AI agents wrote code" is table stakes. "AI-driven strategy → outcome attribution → continuous recalibration" is the product.

Diagram walkthrough

A vertical "outcome funnel" from top to bottom:

Strategic Objective (top, wide box) — "Reduce TTFCV by 30% by Q4"
Linked Epics (4 boxes, narrower) — each epic has alignment_score 0-1
Linked Briefs (12+ boxes, narrowest row) — each Brief has its own outcomeHook
Shipped Code (bottom, PR badges)

To the right, a feedback loop:

Metric instrumentation — outcomeHook fires
Realized vs Forecast — the AR(1) comparison
Recalibration — model updates μ₀, σ, capacity priors
→ loops back to "Strategic Objective" box, updating its forecast trajectory

Color coding:

Strategic objective: dark blue
Epics: medium blue
Briefs: purple (matching the orchestrator color from slide 1)
Shipped: green
Recalibration loop: red (the load-bearing feedback)

The `outcomeHook` field

Every Brief (optionally) defines an outcomeHook. It tells PM33 how to measure whether the work mattered. Example:

outcomeHook:
  metric: sprint_planning_page_p95_ms
  source: /api/metrics
  window: 7d
  attribution_threshold: 0.5  # only credit this Brief if delta > 50%
  predicted_delta: -0.85       # prediction: 85% reduction
  baseline_window: 7d_pre_deploy

When the PR merges, PM33:

Records the baseline value of sprint_planning_page_p95_ms from the 7 days before merge
Opens the 7-day measurement window
At the end of the window, computes the delta
If delta > threshold → credits this Brief in the strategic objective's attribution log
Compares actual delta to predicted_delta → feeds into AR(1) recalibration

The outcomeHook is optional. Briefs without one (most bug fixes, refactors, internal tooling) skip the outcome window and go straight to done. The framework doesn't force attribution where it doesn't make sense.

AR(1) forecast model in plain English

AR(1) = Auto-Regressive lag-1. It's a Bayesian model that predicts the next sprint's velocity (or in this case, the next Brief's impact) based on:

Prior expectations (workspace's historical baseline μ₀)
The last observation (most recent shipped Brief's impact)
A correlation parameter φ (how much the last observation tells us about the next)
Uncertainty σ (how confident we are)

The model shrinks σ when predictions are accurate and widens σ when predictions miss. After 10 Briefs in a strategic area, you can confidently predict the next Brief's impact ± σ. After 30 Briefs, σ shrinks further. After 100 Briefs in a workspace, you have a workspace-specific prior μ₀ that's much better than the industry default.

This is what powers "PM33 told me this Brief would move TTFCV by 12% ± 3% and it actually moved it by 14%" — the platform learning the workspace's actual dynamics over time.

Workflow narrative

Sarah's TTFCV objective from Module 1. Six months in, here's what the recalibration loop has produced:

Sprint 1-5 (months 1-2): AR(1) starts with the default workspace prior (μ₀ = 6 SP/Brief impact, σ = 2.55). Predictions are wide.

Sprint 5-15 (months 2-4): After 11 Briefs shipped to the TTFCV area, PM33 has actual data. The recalibrator updates: this workspace's onboarding-area Briefs actually deliver μ = 8.2 SP equivalent impact with σ = 1.4. Tighter, more confident predictions.

Sprint 15-20 (month 5): One Brief misses badly — predicted 12% TTFCV improvement, actually 0%. The AR(1) σ widens slightly. PM33 surfaces this in the strategic objective dashboard: "Recent miss — investigate." Sarah finds the root cause (the feature flag was off in production for the measurement window). Fix the flag, the next observation comes in correctly. σ tightens again.

Sprint 20-25 (month 6): Sarah's quarterly review. The recalibrated forecast shows: "Current trajectory hits 34% TTFCV reduction by end of Q4, ±4% (95% CI). Top 3 contributing Brief areas: onboarding (54% of impact), API-first-call latency (23%), error-message clarity (12%)." This is a real forecast with real uncertainty, not a vibes-based "we're on track."

The recalibration didn't require Sarah to do anything. It happened automatically. The outcome data → model update → priority shift loop ran every sprint.

Why this is hard for competitors to copy

The architecture sounds simple. The execution is hard for a 3 reasons:

You need lifecycle events for every transition. Without structured events at every state change, you can't compute outcomes. PM33's event bus is in place because every transition (planned → in_progress → in_review → done → outcome_tracked → outcome_attributed) was designed to emit an event. Bolting this on after the fact is a rewrite.
You need tenant-scoped metrics. Attribution only works if you can isolate "this workspace's metric movement" from noise. PM33's RLS + tenant_id-keyed metrics architecture supports this. Multi-tenant SaaS that didn't design for this from day one has the same rewrite problem.
You need workspace-specific priors. The default μ₀ doesn't fit every workspace. AR(1) without per-workspace priors gives terrible predictions for atypical workspaces. The AR1-POOLED-PRIOR-001 design in PM33 specifically addresses this with dynamic workspace-median priors and Bayesian shrinkage. This is months of statistical engineering, not a checkbox.

Anyone can ship "AI agents that write code." Few will ship the closed-loop attribution layer.

What this enables

Three concrete capabilities that buyers care about:

Quarterly business reviews with real attribution. "We shipped X Briefs against objective Y. Top contributors were A, B, C. Aggregate movement was Z% (predicted W%, actual Z%). Here's the audit trail."
Sprint planning that improves. The scheduler weighs Briefs in proven-impact areas higher. The prediction error margin shrinks over time. The team builds confidence that "if PM33 says it'll work, it usually does."
Strategic course-correction. If an objective isn't getting closer to its target despite shipping work against it, PM33 surfaces this. The recalibrator's "predicted trajectory" diverges from "target trajectory" — Pam alerts the owner. Caught early, not at quarter-end.