The numbers
platform teams ran.
A peer-reviewable benchmark report. Every claim traces to a method line. Every dataset is downloadable.
Cohort of 40 enterprise tenants, Q1 2026, 90-day rolling. Methodology and raw datasets published. Reproducer harness on GitHub. We expect engineers to re-run the harness against their own infrastructure and tell us where our numbers don't match.
From API request
to terminal- ready.
The clock starts when the workspace API receives the create request. It stops when the first interactive shell prompt is ready. No client-side bias, no warmed-pool cherry-picking.
We measured 1,000 cold starts per vendor across three regions (us-east-1, eu-west-3, ap-southeast-1) and four tenant sizes (10/50/200/1,000 dev seats). The harness drains warm pools before every trial — so what you're seeing is the genuine first-byte-cold path, not a hot snapshot.
The harness is open. The dataset is signed. The methodology is versioned in-repo and the trials carry the commit SHA they ran against. If your numbers diverge, the diagnostic harness will tell you which percentile flipped and on which tenant size.
- · trials / vendor1,000
- · regions3
- · tenant sizes10 / 50 / 200 / 1k
- · clockAPI → shell prompt
- · warm-pool drainenforced
Side by side.
Numbers only.
Seven metrics chosen because platform teams asked them in procurement reviews. exAI numbers in cobalt; competitor numbers from public benchmarks, vendor-published SLOs, and the cohort harness re-run against each vendor's offering.
| Metric | exAI | GitHub Codespaces | Gitpod | Coder | Self-host fork |
|---|---|---|---|---|---|
| Composer P50 latency | 1.2s | 3.4s | n/a | n/a | 4.1s |
| Composer apply-rate · PRs merged with no edits | 78% | 41% | — | — | 29% |
| Orchestrator wall-clock · 30h migration | 29h 41m | n/a | n/a | n/a | 62h+ |
| Audit log throughput · events/sec | 84,000 | 8,200 | 5,400 | 12,000 | 1,800 |
| RTO · regional failover | 47s | 12m | 9m | 6m | manual |
| Mean prebuild warm time | 0.8s | 21s | 8s | 6s | 33s |
| PII scrub recall · test corpus | 0.998 | 0.91 | n/a | n/a | 0.84 |
Throughput
inverts at the M tier.
Throughput peaks at the 50–500 engineer tier and decays as orgs grow. The plateau is review bandwidth, not generation. Read the heuristic narrative beside the chart.
Small orgs (S) move fast because everyone reviews. The M tier optimizes — Composer's plan-diff-apply flow lands more merge-ready PRs and review queues stay under a working day.
At L and beyond, throughput per engineer inverts — review queues dilate, security and architectural reviewers become bottlenecks, and the marginal AI-PR competes with mandatory human review on every code path. The tooling didn't slow down. The org chart did.
XXL throughput (18 PRs/eng/mo) still beats every comparable public benchmark for AI-assisted development at scale — but the gap to M (31) is the most honest argument we have for keeping your platform-engineering org tight.
- · cohort N40 tenants
- · windowq1 2026 · 90d
- · definitionComposer-authored · merged
- · silent merges0
Three seat counts.
One accounting truth.
Total cost of ownership over 36 months — workspace compute, AI tokens, support, and ops effort priced in. Inputs and assumptions are versioned alongside the dataset so your finance team can rebuild the model from primary numbers.
Benchmarked against an equivalent stack of GitHub Codespaces + GitHub Copilot Enterprise + Sentry + Datadog, sized to match each scenario's seat count. Compute prices use list rates from each vendor as of Q1 2026; AI token costs use the cohort-median spend per merged PR. Ops effort is priced at $180k/yr fully-loaded. Full inputs at /benchmarks/tco.csv.
Run them yourself.
Then talk to us.
The harness clones, the dataset signs, the methodology versions. We don't ship benchmark numbers we wouldn't hand a skeptical platform engineer a stopwatch for. If yours diverge, file an issue.