Agentic Workflow

Agentic feature-implementation workflow

How CAIRE ships software without scaling humans. Eight stages from PRD to merged PR.

This section is the wiki's instantiation of the vision and mandate: scale Caire output 1000× without scaling humans, by making agent execution the default path for shipping features. Humans write PRDs and approve evidence; agents do everything in between.

Human interfaces

Three tiers, each calling the same backend pipeline. L0 and L1 work today; L2 is the next outstanding item.

L0 — Cursor / Claude Code in a worktree

Use this when you want hands-on control of the diff before the runner takes over.

  1. Write wiki/plans/<feature>-YYYY-MM-DD.md with PRD frontmatter and a status callout (see SCHEMA.md).
  2. Create a worktree: ./scripts/git/worktree-add.sh <slug> {feat|fix|chore|docs}/<domain>/<slug>.
  3. Open the PRD in Cursor Composer or invoke Claude Code in the worktree. The agent reads the PRD, writes tests, implements, runs the three review subagents, opens a PR.
  4. Verify before merge by reviewing the PR diff + the dossier folder under docs/dossiers/<feature>/ on GitHub. Approval = a normal GitHub PR review.

L1 — Darwin web at localhost:3010/prd-to-pr (live)

The hands-free path. The dashboard sidebar entry "PRD-to-PR" lists every run; the form at the top kicks one off.

  1. Write wiki/plans/<feature>-YYYY-MM-DD.md and commit to main of beta-appcaire.
  2. On the dashboard, paste the PRD path (e.g. wiki/plans/archive/banner-hello-2026-04-30.md) and click Start run.
  3. The runner walks all 9 stages. The detail page shows live elapsed time, per-stage timing, the agent's stdout log per stage, and an artifact viewer (PRD body, architect's spec, test files, editor's diff, dossier).
  4. After stage 6 verifies the dossier, the run lands at awaiting_approval. Click Approve to queue gh pr merge --auto --squash; click Reject to record a reason in the row's notes.

The runner enforces an auto-kill switch: MAX_CONCURRENT_PRD_RUNS=10 and MAX_RUNS_PER_DAY=30 (override in ~/.config/caire/env or the process env). Submitting past the cap returns HTTP 429.

L2 — Telegram via interface-agent (outstanding)

Described in darwin-as-orchestrator.md. interface-agent routes PRD-style messages to the same POST /api/prd-to-pr endpoint that L1 uses; dossier screenshot posts back to Telegram; reply with "approve" merges. Thin shell over the L1 backend — the routing rule and Telegram handler are the only new code needed.

The nine stages (as built)

The runner's PIPELINE constant in apps/server/src/pipeline/runner.ts walks these in order. Stage names match the dashboard's current_stage column.

0. intake         (worktree creation, env replication, yarn install + generate + per-server db:generate)
   ↓
1. prd            (PRD frontmatter + status callout validated)
   ↓
2. spec           (Architect agent — Opus — emits wiki/specs/<feature>.md Gherkin)
   ↓
3. tests          (Test-writer agent — Sonnet — emits failing vitest + Playwright; runner WIP-commits them)
   ↓
4. implement      (Editor agent — Sonnet — iterates with transactional git stash; MAX_ITERATIONS=8)
   ↓
5. review         (resolver-reviewer / dashboard-reviewer / perf-reviewer subagents in parallel)
   ↓ (P1 finding → re-enter implement; cycle bumps; capped MAX_REVIEW_CYCLES=3)
6. verify         (yarn type-check + lint + Playwright; dossier bundler writes docs/dossiers/<feature>/)
   ↓ (Playwright failure → re-enter implement with verifyHint; same cycle cap)
7. reviewer_loop  (gh push; poll Codex/CodeRabbit comments; re-enter implement on P1/P2 or argue down)
   ↓
8. merge          (gh pr merge --auto --squash; awaits human Approve from dashboard)

Re-entry edges (5→4 and 6→4) bump review_cycle. Hitting the cap fails the run with a surfaced reason; the dashboard's stage timing table records every execution so you see "Implement (2× · 4m 50s)" when iteration looped.

Pages in this section

  1. vision-and-mandate.md — the north star. The "CTO – AI Systems & Agent Workforce" job description in full. Every other page traces back to one of its four commitments.
  2. prd-to-pr-pipeline.md — what each stage produces, where artefacts live (wiki/plans/<feature>-YYYY-MM-DD.md, wiki/specs/<feature>.md, docs/dossiers/<feature>/).
  3. agent-roles-and-model-routing.md — Architect / Test-writer / Editor / Verifier / Reviewer roles, and which model each one runs.
  4. model-and-vendor-agnosticism.md — vision commitment (b). Routing matrix; rotation cadence; adapter shape.
  5. spec-as-contract.md — Thoughtworks SDD pattern: PRD compiles to spec; tests are generated from spec; spec is the coordination object.
  6. verification-and-evidence.md — failure-dossier pattern (Playwright Agents 1.56). The dossier IS the proof artefact attached to the PR.
  7. reviewer-feedback-loop.mdgh api .../pulls/<n>/comments polling; P1/P2 from Codex as failed tests; re-enter Editor.
  8. scale-or-kill.md — vision commitment (c). Auto-promote what works; auto-kill what regresses. Hands-free.
  9. throughput-and-business-signals.md — vision commitment (d). Features per second per token; revenue/cost/cash as system inputs; mathematician chooses model routing inside the cash budget.
  10. darwin-as-orchestrator.md — the path to replacing the human CTO with agent-CTO per the vision. Telegram → Darwin → 4-stage pipeline → PR with screenshot.
  11. skills/humanizer.md — reusable skill for stripping AI tells from public-facing prose. Mandatory for marketing agents; useful anywhere user-visible copy is generated.

What this section is NOT

Cursor “Build plan” (Composer plan) — editor workflow

When the user opens a phased plan under .cursor/plans/*.plan.md and chooses Build plan (or asks the agent to implement it), treat that file as the source of truth for scope and sequencing, not as automatic approval to rewrite unrelated code.

  1. Read the plan and the current code — confirm which phase is in scope (stop at explicit phase boundaries unless the user expands scope).
  2. Spec by tests first when the plan calls for behavior — add or extend Vitest (dashboard-server / dashboard) so the new semantics are reproducible without clicking the UI.
  3. Implement minimally — follow monorepo resolver/UI rules; regenerate GraphQL types only when schema or .graphql files change.
  4. Verifyyarn type-check, yarn lint, and targeted vitest for touched apps; browser check when the change is UI-visible.
  5. Document — if user-visible behavior or workflow expectations change, update the relevant wiki page or CLAUDE.md note in the same effort.

If the plan’s index or cross-links in wiki/ change materially, run yarn wiki:lint from the repo root.

Cross-references

Vision & mandate

The four commitments that define how CAIRE builds — the north star for every architectural decision.

PRD-to-PR pipeline

Eight stages, each with named artefacts. From a one-page brief to a merged pull request with a screenshot.

Agent roles & model routing

Architect, Test-writer, Editor, Verifier, Reviewer — and which model each one runs.

Model & vendor agnosticism

Models change weekly; the architecture doesn't. Why CAIRE routes through adapters instead of vendor SDKs.

Spec as contract

The spec is the coordination object. Tests are generated from the spec — not the other way around.

Verification & evidence

The dossier is the proof. A screenshot bundle attached to every PR — the Playwright Agents 1.56 pattern.

Reviewer feedback loop

Treat Codex / CodeRabbit comments as failed tests. The agent re-enters the loop until reviews are clean.

Scale or kill

Auto-promote what works; auto-kill what regresses. Hands-free ramp via GrowthBook + business signals.

Throughput & business signals

Features per second per dollar. The north-star metric every routing decision is graded against.

Darwin as orchestrator

Telegram → Darwin → PR-with-screenshot. The orchestrator that replaces the human CTO inside the loop.

Darwin component map

The architectural inventory — every named piece of the agent execution stack and how they connect.