Agentic Workflow

Verification and evidence

The dossier is the proof. A screenshot bundle attached to every PR — the Playwright Agents 1.56 pattern.

The pipeline's contract: no dossier, no merge. A green CI run is necessary but not sufficient. Tests can pass while the actual user-facing behaviour is broken (wrong selectors, mocked-out integration, console errors swallowed). The dossier is what proves the feature actually rendered, in a real browser, in a state matching the spec.

This page defines the dossier format and what each element proves.

The docs/dossiers/<feature>/ directory

Every feature branch ships with docs/dossiers/<feature>/ populated by the Verifier agent before the PR can merge (target state). Today these folders are hand-crafted; the bundler that automates the layout below is a gap-list item:

docs/dossiers/<feature>/
├── trace.zip          ← Playwright trace (network, DOM snapshots, video, console)
├── screenshot.png     ← Final rendered state for the primary acceptance scenario
├── console.log        ← Browser console output (filtered to warnings + errors)
├── summary.json       ← Machine-readable: which scenarios passed, timings, costs
└── README.md          ← One-paragraph human summary linking to the spec scenarios

What each artefact proves

trace.zip

Playwright's native trace bundle. Captures:

This is the artefact a human inspects when they don't trust the screenshot. It can be replayed in Playwright Trace Viewer to step through the test as it happened. Mostly an audit tool — agents don't read it; humans do, when something looks off.

screenshot.png

The single image attached to the human-channel notification when the PR is ready. Captured at the end of the primary acceptance scenario named in the spec. Resolution: 1440×900 by default, dark theme, signed-in test user.

This is the proof the human signs off on. The standing rule: a Playwright run that doesn't produce a screenshot showing the implemented feature actually rendering is not verification — it's a no-error smoke test (memory rule feedback_playwright_screenshot_proof.md).

console.log

Filtered to warning and error levels. Catches the class of regression where a component "renders" but emits a stack trace into the console — common when a generated GraphQL hook silently fails or a hook ordering bug doesn't crash but does corrupt state.

Empty console.log is a positive signal. Non-empty is a P1 finding for the reviewer.

summary.json

Structured outcome:

{
  "feature": "lab-promotion-banner",
  "spec": "wiki/specs/lab-promotion-banner.md",
  "scenarios": [
    {
      "name": "Pinned trial is shown above the leaderboard",
      "status": "passed",
      "durationMs": 2143
    },
    {
      "name": "Promote button is disabled for in-progress trials",
      "status": "passed",
      "durationMs": 1876
    }
  ],
  "ci": {
    "type-check": "passed",
    "lint": "passed",
    "vitest": { "passed": 12, "failed": 0, "skipped": 0 },
    "playwright": { "passed": 2, "failed": 0, "skipped": 0 }
  },
  "tokenCost": {
    "architect": { "tokens": 12450, "usd": 0.45 },
    "test-writer": { "tokens": 8200, "usd": 0.07 },
    "editor": { "tokens": 38900, "usd": 0.32 },
    "verifier": { "tokens": 5100, "usd": 0.04 }
  },
  "elapsedSeconds": 1247,
  "throughput": { "featuresPerSecondPerToken": 1.2e-9 }
}

Consumed by:

README.md

One paragraph for humans. Links to the spec, lists the scenarios, embeds the screenshot. This is what gets posted to Telegram via interface-agent when the PR is ready.

What the dossier is not

Mocked vs real-environment evidence

Some features, especially optimization and Timefold workflows, cannot run full real solves in every CI pass. The dossier must separate evidence types explicitly:

For solver-backed features, the spec must state which acceptance scenarios are proven by deterministic mocks and which are proven by short real-solve smoke commands.

Failure cases the dossier catches

Failure Caught by
Component renders but a downstream hook errors silently console.log non-empty
Test passes because a mock returns the expected shape; production doesn't Playwright runs against real backend in CI; trace shows the actual network calls
Acceptance scenario passes structurally but visually broken Screenshot inspected by human
One scenario passes; another silently skipped summary.json shows skipped count
CI runs old code (cache poisoning) summary.json.elapsedSeconds and trace timestamps cross-checked
Token spend exploded on this feature summary.json.tokenCost flagged in scale-or-kill review

Cross-references