CAIRE's engineering execution model is an agentic PRD-to-PR pipeline. A human writes a product brief; specialised AI agents handle spec, tests, implementation, review, and verification; the human approves a screenshot. Compute is the bottleneck โ not headcount.
From a sentence to shipped software. The human appears in two places only โ writing the PRD, and approving the dossier screenshot.
The agentic workflow is not "AI assistance for engineers". It's the execution model. Four commitments make the workflow distinctive.
Caire must scale output 1000ร without scaling humans. We do not build teams. We build execution systems. โ CTO โ AI Systems & Agent Workforce mandate, 2026-04-29
Product owners write PRDs. The pipeline takes the PRD from there โ spec, tests, code, review, dossier, merge โ without a human in the middle of any stage.
Every model call goes through an adapter. Routing is config, not code. When a better or cheaper model ships, rotation is a quarterly review โ not a refactor.
Post-merge metrics ramp a feature 1% โ 100%. Regression flips the flag back. The decision is mechanical โ humans don't decide "ok, ramp this".
Cash balance, revenue, and burn are system inputs. The orchestrator refuses to spawn an expensive run if today's budget is exhausted. No human "tighten the belt" call.
Every architectural decision in the pipeline is graded by one question: does it make us ship more features per second, per dollar (and per token)? The metric is deliberately tiny in absolute value. What matters is the trajectory.
Wall-clock from PRD intake to merged PR. Squeezing this means parallelising stages, caching specs, removing human round-trips. Every shipped feature lands a row in .compound-state/agent-service.db with its elapsed time.
Total model spend across all eight stages, per merged PR. Cheaper providers, smaller models for cheap routing, batch APIs, prompt caching โ every lever points back here. The orchestrator refuses runs whose projected cost would exceed today's budget.
Total input + output tokens across the pipeline, per merged PR. The cleaner the spec and the tighter the dossier contract, the fewer tokens the editor burns iterating. Tokens are a leading indicator of cost.
An optimization mathematician agent reads the throughput log every week and proposes routing changes โ different model per role, different batch size, different cache strategy. The CPO/CTO agent ratifies. The ratchet only moves one way.
This page itself will move with the metric. New routing wins, new agent prompts, new dossier shapes โ every improvement that nudges features-per-second-per-dollar lands here as a refresh.
Each stage produces a checked artefact. The next stage refuses to start without it. The pipeline is one continuous flow, not a checklist โ every stage hands off a typed result.
The two re-entry edges (P1 found โ re-enter editor) are the only loops. Everything else is a one-way contract from PRD to merged commit.
| # | Stage | Output | Driven by |
|---|---|---|---|
| 0 | Intake normalize Convert chat, brief, or Cursor plan into a portable PRD. |
Canonical PRD path + feature slug. | Orchestrator |
| 1 | PRD A human writes the brief: change, success metric, non-goals. |
wiki/plans/<feature>-YYYY-MM-DD.md |
Human (only required step) |
| 2 | Spec Architect agent emits Gherkin acceptance scenarios. |
wiki/specs/<feature>.md |
Architect (reasoning model) |
| 3 | Test stubs Test-writer compiles each scenario into a failing test. |
Vitest + Playwright tests that fail on first run. | Test-writer (mid model) |
| 4 | Implementation Editor iterates on the diff until every test passes. |
Code in an isolated worktree; all target tests green. | Editor (mid model) |
| 5 | Self-review Three reviewer subagents inspect the diff in parallel. |
P1 / P2 / P3 findings list, or "clean". | resolver-reviewer ยท dashboard-reviewer ยท perf-reviewer |
| 6 | Verification & dossier Type-check, lint, tests, Playwright trace + screenshot bundle. |
docs/dossiers/<feature>/{trace.zip, screenshot.png, summary.json} |
Verifier (mid model) |
| 7 | Reviewer loop External review-bot comments re-enter the editor on P1 / P2. |
Every finding addressed or argued down with a justification. | Reviewer-feedback (mid model) |
| 8 | Merge + evidence Auto-merge queued; dossier screenshot delivered for human sign-off. |
Merged commit on main + screenshot to the human channel. |
Orchestrator + GitHub Merge Queue |
The contract is sharp: no dossier, no merge. Every stage's output is a typed, persisted artefact that the next stage reads โ and that a human, an audit, or a future agent can replay.
The agentic workflow doesn't make humans disappear โ it makes them strategic. Each role has one or two narrow places to step in. Everything else is software.
wiki/plans/ with a success metric and explicit non-goals.summary.json for skipped scenarios or empty Playwright traces.docs/dossiers/<feature>/ on a merged PR โ full Playwright trace, screenshot, console log.The pipeline targets steady-state product engineering โ the work that, in a traditional team, fills standups and sprints. Bigger architectural moves still get a human-led plan.
A new banner, a new page, a form variation. PRD names the acceptance scenarios; pipeline writes vitest + Playwright tests; Editor implements; Verifier captures the screenshot dossier.
PRD frames the bug as a failing scenario. Test-writer compiles it; Editor fixes; resolver-reviewer + perf-reviewer catch N+1 regressions before the PR opens.
Throughput row, KPI tile, dashboard chart. Spec names the data source; tests assert the shape; the dossier shows the metric rendering with realistic seed data.
Quarterly: the optimization mathematician proposes a routing change based on cost, pass rate, latency. CPO/CTO agent ratifies. One config edit; the adapter handles the rest.
Audit pass like the agentic-workflow audit itself: identify drift, fix the doc, run yarn wiki:lint, ship. Wiki-only PRs use the same eight stages with the verifier in light mode.
New CSV adapter, new gate provider, new external feed. PRD names the contract; tests cover the boundary; dossier proves the integration with real fixtures and a recorded trace.
Autonomous loops without guardrails are how AI projects burn budgets and ship regressions. Every stage of the pipeline is fenced. The Editor agent sits at the centre, surrounded by mechanisms that can either slow it down, redirect it, or stop it entirely.
Eight independent guardrails. No single one prevents bugs alone; together they make autonomous shipping safe enough that the human's only required action is reading a screenshot.
The Editor cannot ship behaviour the spec didn't name. Drift between the PRD and the spec is itself a P2 finding for the verifier โ and the spec is short enough to fit in every agent's context, so drift is always provable.
MAX_ITERATIONS = 8 on the editor's inner loop, plus a max of 3 re-entry cycles from review or Codex feedback. Hit the ceiling and the run surfaces failure to a human โ it never grinds.
The optional PIPELINE_BUDGET_ENFORCEMENT flag (off in pilot, on once revenue is real) refuses further model calls when the per-run cost would exceed the cap. In pilot the human is the only PRD producer, so cost is implicitly bounded.
A green CI run is necessary but not sufficient. The dossier โ Playwright trace, final screenshot, console log, machine-readable summary โ proves the feature actually rendered. Reviewers can replay the trace; the human sees the screenshot.
Same pipeline, three ways to enter it. Pick the surface that matches the moment โ a PRD in version control, a form on the Dashboard, a message on Telegram. Every surface produces the same dossier and the same merge decision.
Write wiki/plans/<feature>-YYYY-MM-DD.md, create an isolated worktree with ./scripts/git/worktree-add.sh, and the pipeline runs against that branch. Reviewer subagents lint the diff, the verifier captures the dossier, the merge queue lands the PR. Best for engineers shipping in version control.
Paste a PRD path, click start, watch the pipeline progress at localhost:3010. The dossier viewer surfaces the screenshot, console log, and machine-readable summary inline. Approve / Reject is a button โ no GitHub round-trip required. Best for product owners who want a UI, not a CLI.
Post a PRD link to interface-agent. The same backend runs the pipeline; the dossier screenshot posts back to the originating thread. Reply approve and it merges. The lightest possible surface โ a notification and an image. Best for the founder reading on a phone between meetings.
The pipeline is built on three open patterns and one discipline. Nothing here is bespoke for the sake of bespoke.
One expensive reasoning pass produces the spec; many cheap edit passes implement against it. ~1/14th the cost of running every call on the reasoning model โ the cost driver is the editor, not the architect.
Plain language is too imprecise to coordinate multiple agents. Acceptance scenarios in Gherkin are short enough to fit every agent's context, precise enough to compile to failing tests, and greppable.
A green CI run is necessary but not sufficient. The dossier โ trace, screenshot, console log, machine-readable summary โ proves the feature actually rendered, in a real browser, in the state the spec named.
External review bots (Codex, CodeRabbit) post comments after every push. The pipeline polls them, treats P1/P2 as failed tests, and re-enters the editor automatically. The discipline is non-optional.
How CAIRE composes AI agents into an operating system for home-care delivery.
VRPTW, NP-hardness, and the hybrid human-AI optimization model that powers scheduling.
How CAIRE handles AI regulation, audit trails, and responsible deployment in healthcare.
Twelve guides cover every stage in depth โ vision and mandate, agent roles, the dossier pattern, the reviewer-feedback loop, and the orchestrator that ties it together.
Read the long form AI-OS for Business