How CAIRE Ships

Humans define what. Agents execute how.

CAIRE's engineering execution model is an agentic PRD-to-PR pipeline. A human writes a product brief; specialised AI agents handle spec, tests, implementation, review, and verification; the human approves a screenshot. Compute is the bottleneck โ€” not headcount.

flowchart LR Human(["๐Ÿ‘ค
Human
writes a brief in plain English"]) -->|PRD| Pipeline(["๐Ÿค–
Eight specialised agents
spec ยท tests ยท code ยท review ยท dossier"]) Pipeline -->|merged PR + screenshot| Approve(["โœ…
Human approves
silence = approval after deadline"]) Approve -->|metrics watch| ScaleKill(["๐Ÿ“ˆ
Scale or kill
auto-ramp ยท auto-rollback"]) style Human fill:#dbeafe,stroke:#2563EB,stroke-width:2px style Pipeline fill:#ede9fe,stroke:#9333EA,stroke-width:2px style Approve fill:#dcfce7,stroke:#16a34a,stroke-width:2px style ScaleKill fill:#fef3c7,stroke:#d97706,stroke-width:2px

From a sentence to shipped software. The human appears in two places only โ€” writing the PRD, and approving the dossier screenshot.

f / s / $
North-star: features per second per dollar (and per token)
8
Stages from PRD to merged PR
1
Human action required: approving a screenshot
0
Vendor lock-in โ€” every model call goes via an adapter

The mandate

The agentic workflow is not "AI assistance for engineers". It's the execution model. Four commitments make the workflow distinctive.

Caire must scale output 1000ร— without scaling humans. We do not build teams. We build execution systems. โ€” CTO โ€“ AI Systems & Agent Workforce mandate, 2026-04-29
A

Humans define what and why. Agents execute how.

Product owners write PRDs. The pipeline takes the PRD from there โ€” spec, tests, code, review, dossier, merge โ€” without a human in the middle of any stage.

B

Tool and model agnosticism.

Every model call goes through an adapter. Routing is config, not code. When a better or cheaper model ships, rotation is a quarterly review โ€” not a refactor.

C

If it works, scale automatically. If it doesn't, kill automatically.

Post-merge metrics ramp a feature 1% โ†’ 100%. Regression flips the flag back. The decision is mechanical โ€” humans don't decide "ok, ramp this".

D

Execution tied to business signals.

Cash balance, revenue, and burn are system inputs. The orchestrator refuses to spawn an expensive run if today's budget is exhausted. No human "tighten the belt" call.

The north star: features per second per dollar

Every architectural decision in the pipeline is graded by one question: does it make us ship more features per second, per dollar (and per token)? The metric is deliberately tiny in absolute value. What matters is the trajectory.

Features per second

Wall-clock from PRD intake to merged PR. Squeezing this means parallelising stages, caching specs, removing human round-trips. Every shipped feature lands a row in .compound-state/agent-service.db with its elapsed time.

Per dollar

Total model spend across all eight stages, per merged PR. Cheaper providers, smaller models for cheap routing, batch APIs, prompt caching โ€” every lever points back here. The orchestrator refuses runs whose projected cost would exceed today's budget.

Per token

Total input + output tokens across the pipeline, per merged PR. The cleaner the spec and the tighter the dossier contract, the fewer tokens the editor burns iterating. Tokens are a leading indicator of cost.

The trajectory matters

An optimization mathematician agent reads the throughput log every week and proposes routing changes โ€” different model per role, different batch size, different cache strategy. The CPO/CTO agent ratifies. The ratchet only moves one way.

This page itself will move with the metric. New routing wins, new agent prompts, new dossier shapes โ€” every improvement that nudges features-per-second-per-dollar lands here as a refresh.

Eight stages, PRD to PR

Each stage produces a checked artefact. The next stage refuses to start without it. The pipeline is one continuous flow, not a checklist โ€” every stage hands off a typed result.

flowchart TD PRD([๐Ÿ“ PRD
wiki/plans/<feature>.md]) --> Spec([๐Ÿ“ Spec
Architect
Gherkin acceptance]) Spec --> Tests([๐Ÿงช Failing tests
Test-writer
vitest + Playwright]) Tests --> Implement([๐Ÿ’ป Implement
Editor
iterate until tests pass]) Implement --> Review{๐Ÿ” Self-review
3 reviewer subagents} Review -->|P1 found| Implement Review -->|clean| Verify([๐Ÿ“ฆ Verify + Dossier
Verifier
trace.zip ยท screenshot ยท summary.json]) Verify --> Push([๐Ÿšข Push to GitHub]) Push --> ReviewerLoop{๐Ÿค External review
Codex + CodeRabbit} ReviewerLoop -->|P1 / P2| Implement ReviewerLoop -->|clean| Approve{๐Ÿ‘ค Human approves
screenshot sign-off} Approve -->|yes| Merge([๐Ÿš€ Merge queue
gh pr merge --auto --squash]) Approve -->|no| Spec Merge --> Done([๐ŸŽ‰ Shipped + throughput row]) classDef agent fill:#ede9fe,stroke:#7C3AED,stroke-width:2px,color:#1a1a1a classDef gate fill:#fef3c7,stroke:#d97706,stroke-width:2px,color:#1a1a1a classDef terminal fill:#dcfce7,stroke:#16a34a,stroke-width:2px,color:#1a1a1a classDef plumbing fill:#e0f2fe,stroke:#0284c7,stroke-width:2px,color:#1a1a1a class Spec,Tests,Implement,Verify agent class Review,ReviewerLoop,Approve gate class PRD,Done terminal class Push,Merge plumbing

The two re-entry edges (P1 found โ†’ re-enter editor) are the only loops. Everything else is a one-way contract from PRD to merged commit.

Agent stage Gate / decision Plumbing Terminal state
# Stage Output Driven by
0 Intake normalize
Convert chat, brief, or Cursor plan into a portable PRD.
Canonical PRD path + feature slug. Orchestrator
1 PRD
A human writes the brief: change, success metric, non-goals.
wiki/plans/<feature>-YYYY-MM-DD.md Human (only required step)
2 Spec
Architect agent emits Gherkin acceptance scenarios.
wiki/specs/<feature>.md Architect (reasoning model)
3 Test stubs
Test-writer compiles each scenario into a failing test.
Vitest + Playwright tests that fail on first run. Test-writer (mid model)
4 Implementation
Editor iterates on the diff until every test passes.
Code in an isolated worktree; all target tests green. Editor (mid model)
5 Self-review
Three reviewer subagents inspect the diff in parallel.
P1 / P2 / P3 findings list, or "clean". resolver-reviewer ยท dashboard-reviewer ยท perf-reviewer
6 Verification & dossier
Type-check, lint, tests, Playwright trace + screenshot bundle.
docs/dossiers/<feature>/{trace.zip, screenshot.png, summary.json} Verifier (mid model)
7 Reviewer loop
External review-bot comments re-enter the editor on P1 / P2.
Every finding addressed or argued down with a justification. Reviewer-feedback (mid model)
8 Merge + evidence
Auto-merge queued; dossier screenshot delivered for human sign-off.
Merged commit on main + screenshot to the human channel. Orchestrator + GitHub Merge Queue

The contract is sharp: no dossier, no merge. Every stage's output is a typed, persisted artefact that the next stage reads โ€” and that a human, an audit, or a future agent can replay.

Who shows up where

The agentic workflow doesn't make humans disappear โ€” it makes them strategic. Each role has one or two narrow places to step in. Everything else is software.

๐Ÿ“

Product owner / PM

Defines what + why
  • Writes the PRD in wiki/plans/ with a success metric and explicit non-goals.
  • Approves (or rejects) the dossier screenshot before merge.
  • Sets the regression threshold that scale-or-kill watches post-merge.
๐Ÿ’ป

Developer

Reviews, doesn't author
  • Reads the auto-generated spec for drift from the PRD.
  • Argues down P2 reviewer comments with a justification when the agent is wrong.
  • Maintains the agent prompts and the model adapter โ€” code about how code gets written.
๐Ÿ› ๏ธ

DevOps / SRE

Owns the substrate
  • Operates Darwin (the orchestrator runtime), the launchd job slots, the merge queue.
  • Watches the throughput log: cost per merged PR, features per second per token.
  • Approves model-routing rotations from the optimization mathematician's weekly proposal.
๐Ÿงช

QA / Tester

Writes the rules, not the cases
  • Curates the Gherkin patterns the Test-writer agent compiles from.
  • Audits dossier summary.json for skipped scenarios or empty Playwright traces.
  • Owns the "no dossier, no merge" gate โ€” the only thing the orchestrator cannot skip.
๐Ÿ“ˆ

Investor / Board

Watches the leverage
  • Reads cost-per-merged-PR trending down month over month as the routing matrix tightens.
  • Tracks features-per-second-per-token as the leverage ratio that doesn't depend on hiring.
  • Ratifies quarterly model-routing decisions; does not pick models.
๐Ÿ”

Customer evaluator

Audits the trail
  • Inspects docs/dossiers/<feature>/ on a merged PR โ€” full Playwright trace, screenshot, console log.
  • Reads the PRD frontmatter to map a shipped feature back to the original brief.
  • Verifies vendor-agnosticism by reading the routing config โ€” no provider lock-in to inherit.

What gets shipped this way

The pipeline targets steady-state product engineering โ€” the work that, in a traditional team, fills standups and sprints. Bigger architectural moves still get a human-led plan.

๐ŸŽจ

Ship a UI feature

A new banner, a new page, a form variation. PRD names the acceptance scenarios; pipeline writes vitest + Playwright tests; Editor implements; Verifier captures the screenshot dossier.

๐Ÿ›

Fix a resolver bug

PRD frames the bug as a failing scenario. Test-writer compiles it; Editor fixes; resolver-reviewer + perf-reviewer catch N+1 regressions before the PR opens.

๐Ÿ“Š

Add a metric

Throughput row, KPI tile, dashboard chart. Spec names the data source; tests assert the shape; the dossier shows the metric rendering with realistic seed data.

๐Ÿ”€

Rotate a model

Quarterly: the optimization mathematician proposes a routing change based on cost, pass rate, latency. CPO/CTO agent ratifies. One config edit; the adapter handles the rest.

๐Ÿ“š

Refresh a wiki page

Audit pass like the agentic-workflow audit itself: identify drift, fix the doc, run yarn wiki:lint, ship. Wiki-only PRs use the same eight stages with the verifier in light mode.

๐ŸŒ

Add an integration

New CSV adapter, new gate provider, new external feed. PRD names the contract; tests cover the boundary; dossier proves the integration with real fixtures and a recorded trace.

What keeps it from going off the rails

Autonomous loops without guardrails are how AI projects burn budgets and ship regressions. Every stage of the pipeline is fenced. The Editor agent sits at the centre, surrounded by mechanisms that can either slow it down, redirect it, or stop it entirely.

flowchart TB subgraph Center [" "] direction TB Editor("๐Ÿ’ป
Editor agent
writes code in worktree") end Spec("๐Ÿ“ Spec contract
cannot drift past Gherkin") MaxIter("๐Ÿ” MAX_ITERATIONS = 8
surface failure, don't grind") Reviewers("๐Ÿ” 3 review subagents
resolver ยท dashboard ยท perf") Codex("๐Ÿค– Codex + CodeRabbit
external P1/P2 polling") Dossier("๐Ÿ“ฆ No dossier, no merge
real-browser proof required") Budget("๐Ÿ’ต Cost ceiling
opt-in ยท off in pilot") Worktree("๐ŸŒณ Per-feature worktree
never edits main directly") Approval("๐Ÿ‘ค Human approval gate
screenshot sign-off before merge") Spec --> Editor MaxIter --> Editor Reviewers --> Editor Codex --> Editor Dossier --> Editor Budget --> Editor Worktree --> Editor Approval --> Editor classDef guard fill:#fff7ed,stroke:#d97706,stroke-width:2px,color:#1a1a1a classDef center fill:#ede9fe,stroke:#7c3aed,stroke-width:3px,color:#1a1a1a class Spec,MaxIter,Reviewers,Codex,Dossier,Budget,Worktree,Approval guard class Editor center

Eight independent guardrails. No single one prevents bugs alone; together they make autonomous shipping safe enough that the human's only required action is reading a screenshot.

Spec is the contract

The Editor cannot ship behaviour the spec didn't name. Drift between the PRD and the spec is itself a P2 finding for the verifier โ€” and the spec is short enough to fit in every agent's context, so drift is always provable.

Iteration is bounded

MAX_ITERATIONS = 8 on the editor's inner loop, plus a max of 3 re-entry cycles from review or Codex feedback. Hit the ceiling and the run surfaces failure to a human โ€” it never grinds.

Cost is bounded โ€” when you turn it on

The optional PIPELINE_BUDGET_ENFORCEMENT flag (off in pilot, on once revenue is real) refuses further model calls when the per-run cost would exceed the cap. In pilot the human is the only PRD producer, so cost is implicitly bounded.

No dossier, no merge

A green CI run is necessary but not sufficient. The dossier โ€” Playwright trace, final screenshot, console log, machine-readable summary โ€” proves the feature actually rendered. Reviewers can replay the trace; the human sees the screenshot.

Three surfaces into the loop

Same pipeline, three ways to enter it. Pick the surface that matches the moment โ€” a PRD in version control, a form on the Dashboard, a message on Telegram. Every surface produces the same dossier and the same merge decision.

File-based PRD

Write wiki/plans/<feature>-YYYY-MM-DD.md, create an isolated worktree with ./scripts/git/worktree-add.sh, and the pipeline runs against that branch. Reviewer subagents lint the diff, the verifier captures the dossier, the merge queue lands the PR. Best for engineers shipping in version control.

Darwin Dashboard

Paste a PRD path, click start, watch the pipeline progress at localhost:3010. The dossier viewer surfaces the screenshot, console log, and machine-readable summary inline. Approve / Reject is a button โ€” no GitHub round-trip required. Best for product owners who want a UI, not a CLI.

Telegram

Post a PRD link to interface-agent. The same backend runs the pipeline; the dossier screenshot posts back to the originating thread. Reply approve and it merges. The lightest possible surface โ€” a notification and an image. Best for the founder reading on a phone between meetings.

Why this works

The pipeline is built on three open patterns and one discipline. Nothing here is bespoke for the sake of bespoke.

Aider's architect / editor split

One expensive reasoning pass produces the spec; many cheap edit passes implement against it. ~1/14th the cost of running every call on the reasoning model โ€” the cost driver is the editor, not the architect.

Spec-driven development (Gherkin)

Plain language is too imprecise to coordinate multiple agents. Acceptance scenarios in Gherkin are short enough to fit every agent's context, precise enough to compile to failing tests, and greppable.

Playwright dossier (no dossier, no merge)

A green CI run is necessary but not sufficient. The dossier โ€” trace, screenshot, console log, machine-readable summary โ€” proves the feature actually rendered, in a real browser, in the state the spec named.

Reviewer-feedback loop

External review bots (Codex, CodeRabbit) post comments after every push. The pipeline polls them, treats P1/P2 as failed tests, and re-enters the editor automatically. The discipline is non-optional.

Explore more

๐Ÿง 

AI-OS for Business

How CAIRE composes AI agents into an operating system for home-care delivery.

๐Ÿ›ฃ๏ธ

Routing Science

VRPTW, NP-hardness, and the hybrid human-AI optimization model that powers scheduling.

๐Ÿ›ก๏ธ

AI Compliance

How CAIRE handles AI regulation, audit trails, and responsible deployment in healthcare.

Want to see the pipeline up close?

Twelve guides cover every stage in depth โ€” vision and mandate, agent roles, the dossier pattern, the reviewer-feedback loop, and the orchestrator that ties it together.

Read the long form AI-OS for Business