Verification and evidence

The pipeline's contract: no dossier, no merge. A green CI run is necessary but not sufficient. Tests can pass while the actual user-facing behaviour is broken (wrong selectors, mocked-out integration, console errors swallowed). The dossier is what proves the feature actually rendered, in a real browser, in a state matching the spec.

This page defines the dossier format and what each element proves.

The `docs/dossiers/<feature>/` directory

Every feature branch ships with docs/dossiers/<feature>/ populated by the Verifier agent before the PR can merge (target state). Today these folders are hand-crafted; the bundler that automates the layout below is a gap-list item:

docs/dossiers/<feature>/
├── trace.zip          ← Playwright trace (network, DOM snapshots, video, console)
├── screenshot.png     ← Final rendered state for the primary acceptance scenario
├── console.log        ← Browser console output (filtered to warnings + errors)
├── summary.json       ← Machine-readable: which scenarios passed, timings, costs
└── README.md          ← One-paragraph human summary linking to the spec scenarios

What each artefact proves

`trace.zip`

Playwright's native trace bundle. Captures:

Every network request (URL, method, status, response body).
DOM snapshot at each interaction.
Console messages in real time.
Optional video of the run.

This is the artefact a human inspects when they don't trust the screenshot. It can be replayed in Playwright Trace Viewer to step through the test as it happened. Mostly an audit tool — agents don't read it; humans do, when something looks off.

`screenshot.png`

The single image attached to the human-channel notification when the PR is ready. Captured at the end of the primary acceptance scenario named in the spec. Resolution: 1440×900 by default, dark theme, signed-in test user.

This is the proof the human signs off on. The standing rule: a Playwright run that doesn't produce a screenshot showing the implemented feature actually rendering is not verification — it's a no-error smoke test (memory rule feedback_playwright_screenshot_proof.md).

`console.log`

Filtered to warning and error levels. Catches the class of regression where a component "renders" but emits a stack trace into the console — common when a generated GraphQL hook silently fails or a hook ordering bug doesn't crash but does corrupt state.

Empty console.log is a positive signal. Non-empty is a P1 finding for the reviewer.

`summary.json`

Structured outcome:

{
  "feature": "lab-promotion-banner",
  "spec": "wiki/specs/lab-promotion-banner.md",
  "scenarios": [
    {
      "name": "Pinned trial is shown above the leaderboard",
      "status": "passed",
      "durationMs": 2143
    },
    {
      "name": "Promote button is disabled for in-progress trials",
      "status": "passed",
      "durationMs": 1876
    }
  ],
  "ci": {
    "type-check": "passed",
    "lint": "passed",
    "vitest": { "passed": 12, "failed": 0, "skipped": 0 },
    "playwright": { "passed": 2, "failed": 0, "skipped": 0 }
  },
  "tokenCost": {
    "architect": { "tokens": 12450, "usd": 0.45 },
    "test-writer": { "tokens": 8200, "usd": 0.07 },
    "editor": { "tokens": 38900, "usd": 0.32 },
    "verifier": { "tokens": 5100, "usd": 0.04 }
  },
  "elapsedSeconds": 1247,
  "throughput": { "featuresPerSecondPerToken": 1.2e-9 }
}

Consumed by:

The merge gate (ci.* must all be passed).
The throughput logger (writes a row to .compound-state/agent-service.db).
The optimization mathematician (uses cost / throughput / failure data to rotate model routing).

`README.md`

One paragraph for humans. Links to the spec, lists the scenarios, embeds the screenshot. This is what gets posted to Telegram via interface-agent when the PR is ready.

What the dossier is not

Not a substitute for the spec. The spec defines what the dossier must prove.
Not a substitute for code review. Reviewer subagents read the diff; the dossier verifies behaviour.
Not a snapshot test. Snapshot tests freeze incidental rendering; the dossier captures intentional behaviour against a typed contract.
Not optional. A PR without a dossier is a PR without proof — the orchestrator refuses to queue it for auto-merge.

Mocked vs real-environment evidence

Some features, especially optimization and Timefold workflows, cannot run full real solves in every CI pass. The dossier must separate evidence types explicitly:

Deterministic evidence: unit, integration, and browser tests with mocked external systems. These are required in CI.
Real-environment smoke evidence: short bounded runs against real services when credentials and latency budget are available. These are recorded in summary.json as optional smoke checks, not as a substitute for deterministic tests.

For solver-backed features, the spec must state which acceptance scenarios are proven by deterministic mocks and which are proven by short real-solve smoke commands.

Failure cases the dossier catches

Failure	Caught by
Component renders but a downstream hook errors silently	`console.log` non-empty
Test passes because a mock returns the expected shape; production doesn't	Playwright runs against real backend in CI; trace shows the actual network calls
Acceptance scenario passes structurally but visually broken	Screenshot inspected by human
One scenario passes; another silently skipped	`summary.json` shows skipped count
CI runs old code (cache poisoning)	`summary.json.elapsedSeconds` and trace timestamps cross-checked
Token spend exploded on this feature	`summary.json.tokenCost` flagged in scale-or-kill review

Cross-references

PRD-to-PR pipeline — stages 5–6 produce the dossier.
Spec as contract — what the dossier verifies against.
Reviewer feedback loop — how a non-empty console.log becomes a P1 finding.
Throughput and business signals — summary.json.tokenCost feeds the throughput logger.
Playwright e2e catalog — the existing tests this pattern composes on top of.

The docs/dossiers/<feature>/ directory

What each artefact proves

trace.zip

screenshot.png

console.log

summary.json

README.md