The pipeline's contract: no dossier, no merge. A green CI run is necessary but not sufficient. Tests can pass while the actual user-facing behaviour is broken (wrong selectors, mocked-out integration, console errors swallowed). The dossier is what proves the feature actually rendered, in a real browser, in a state matching the spec.
This page defines the dossier format and what each element proves.
The docs/dossiers/<feature>/ directory
Every feature branch ships with docs/dossiers/<feature>/ populated by the Verifier agent before the PR can merge (target state). Today these folders are hand-crafted; the bundler that automates the layout below is a gap-list item:
docs/dossiers/<feature>/
├── trace.zip ← Playwright trace (network, DOM snapshots, video, console)
├── screenshot.png ← Final rendered state for the primary acceptance scenario
├── console.log ← Browser console output (filtered to warnings + errors)
├── summary.json ← Machine-readable: which scenarios passed, timings, costs
└── README.md ← One-paragraph human summary linking to the spec scenarios
What each artefact proves
trace.zip
Playwright's native trace bundle. Captures:
- Every network request (URL, method, status, response body).
- DOM snapshot at each interaction.
- Console messages in real time.
- Optional video of the run.
This is the artefact a human inspects when they don't trust the screenshot. It can be replayed in Playwright Trace Viewer to step through the test as it happened. Mostly an audit tool — agents don't read it; humans do, when something looks off.
screenshot.png
The single image attached to the human-channel notification when the PR is ready. Captured at the end of the primary acceptance scenario named in the spec. Resolution: 1440×900 by default, dark theme, signed-in test user.
This is the proof the human signs off on. The standing rule: a Playwright run that doesn't produce a screenshot showing the implemented feature actually rendering is not verification — it's a no-error smoke test (memory rule feedback_playwright_screenshot_proof.md).
console.log
Filtered to warning and error levels. Catches the class of regression where a component "renders" but emits a stack trace into the console — common when a generated GraphQL hook silently fails or a hook ordering bug doesn't crash but does corrupt state.
Empty console.log is a positive signal. Non-empty is a P1 finding for the reviewer.
summary.json
Structured outcome:
{
"feature": "lab-promotion-banner",
"spec": "wiki/specs/lab-promotion-banner.md",
"scenarios": [
{
"name": "Pinned trial is shown above the leaderboard",
"status": "passed",
"durationMs": 2143
},
{
"name": "Promote button is disabled for in-progress trials",
"status": "passed",
"durationMs": 1876
}
],
"ci": {
"type-check": "passed",
"lint": "passed",
"vitest": { "passed": 12, "failed": 0, "skipped": 0 },
"playwright": { "passed": 2, "failed": 0, "skipped": 0 }
},
"tokenCost": {
"architect": { "tokens": 12450, "usd": 0.45 },
"test-writer": { "tokens": 8200, "usd": 0.07 },
"editor": { "tokens": 38900, "usd": 0.32 },
"verifier": { "tokens": 5100, "usd": 0.04 }
},
"elapsedSeconds": 1247,
"throughput": { "featuresPerSecondPerToken": 1.2e-9 }
}
Consumed by:
- The merge gate (
ci.*must all bepassed). - The throughput logger (writes a row to
.compound-state/agent-service.db). - The optimization mathematician (uses cost / throughput / failure data to rotate model routing).
README.md
One paragraph for humans. Links to the spec, lists the scenarios, embeds the screenshot. This is what gets posted to Telegram via interface-agent when the PR is ready.
What the dossier is not
- Not a substitute for the spec. The spec defines what the dossier must prove.
- Not a substitute for code review. Reviewer subagents read the diff; the dossier verifies behaviour.
- Not a snapshot test. Snapshot tests freeze incidental rendering; the dossier captures intentional behaviour against a typed contract.
- Not optional. A PR without a dossier is a PR without proof — the orchestrator refuses to queue it for auto-merge.
Mocked vs real-environment evidence
Some features, especially optimization and Timefold workflows, cannot run full real solves in every CI pass. The dossier must separate evidence types explicitly:
- Deterministic evidence: unit, integration, and browser tests with mocked external systems. These are required in CI.
- Real-environment smoke evidence: short bounded runs against real services when credentials and latency budget are available. These are recorded in
summary.jsonas optional smoke checks, not as a substitute for deterministic tests.
For solver-backed features, the spec must state which acceptance scenarios are proven by deterministic mocks and which are proven by short real-solve smoke commands.
Failure cases the dossier catches
| Failure | Caught by |
|---|---|
| Component renders but a downstream hook errors silently | console.log non-empty |
| Test passes because a mock returns the expected shape; production doesn't | Playwright runs against real backend in CI; trace shows the actual network calls |
| Acceptance scenario passes structurally but visually broken | Screenshot inspected by human |
| One scenario passes; another silently skipped | summary.json shows skipped count |
| CI runs old code (cache poisoning) | summary.json.elapsedSeconds and trace timestamps cross-checked |
| Token spend exploded on this feature | summary.json.tokenCost flagged in scale-or-kill review |
Cross-references
- PRD-to-PR pipeline — stages 5–6 produce the dossier.
- Spec as contract — what the dossier verifies against.
- Reviewer feedback loop — how a non-empty console.log becomes a P1 finding.
- Throughput and business signals —
summary.json.tokenCostfeeds the throughput logger. - Playwright e2e catalog — the existing tests this pattern composes on top of.