§ 01 / 06

Benchmarks · independently runnableMethodology · publishedRe-run them on your own cluster.

cohort live · n=40 · 90-day rolling

exAI Agentic OS · benchmarks · raw datasets · reproducer harness

The numbers
platform teams ran.

A peer-reviewable benchmark report. Every claim traces to a method line. Every dataset is downloadable.

Cohort of 40 enterprise tenants, Q1 2026, 90-day rolling. Methodology and raw datasets published. Reproducer harness on GitHub. We expect engineers to re-run the harness against their own infrastructure and tell us where our numbers don't match.

Download dataset (1.8 MB)Run the harness

cohort summary · q1 2026

v2026.04

Tenants in cohort

enterprise · self-reported

Days · rolling

Jan 7 → Apr 7 · 2026

Cold-start P50

0 ms

1,000-trial mean

AI-PRs / eng / mo

cohort median

sha-256 · 4f82a1c…verify →

Fig. 01 · cohort snapshot · q1 2026n=40 · 90d

§ 02 / 06

Workspace cold-start · P50/P90/P99

From API request
to terminal- ready.

The clock starts when the workspace API receives the create request. It stops when the first interactive shell prompt is ready. No client-side bias, no warmed-pool cherry-picking.

We measured 1,000 cold starts per vendor across three regions (us-east-1, eu-west-3, ap-southeast-1) and four tenant sizes (10/50/200/1,000 dev seats). The harness drains warm pools before every trial — so what you're seeing is the genuine first-byte-cold path, not a hot snapshot.

The harness is open. The dataset is signed. The methodology is versioned in-repo and the trials carry the commit SHA they ran against. If your numbers diverge, the diagnostic harness will tell you which percentile flipped and on which tenant size.

· trials / vendor1,000
· regions3
· tenant sizes10 / 50 / 200 / 1k
· clockAPI → shell prompt
· warm-pool drainenforced

workspace.cold-start · n=5,000 trials

scale · 0 → 60s

vendor

p50 · p90 · p99

exAI

p50

612ms

p90

1.2s

p99

2.6s

GitHub Codespaces

p50

38s

p90

52s

p99

71s

Gitpod

p50

14s

p90

22s

p99

39s

Coder

p50

9.2s

p90

18s

p99

31s

Self-host fork

p50

45s

p90

62s

p99

92s

0s15s30s45s60s

methodology · /benchmarks/cold-start.mdsha-256 · 4f82a1c · signed

Fig. 02 · cold-start distribution · 5 vendorsP50 · P90 · P99

§ 03 / 06

Comparison matrix · seven metrics

Side by side.
Numbers only.

Seven metrics chosen because platform teams asked them in procurement reviews. exAI numbers in cobalt; competitor numbers from public benchmarks, vendor-published SLOs, and the cohort harness re-run against each vendor's offering.

Metric	exAI	GitHub Codespaces	Gitpod	Coder	Self-host fork
Composer P50 latency	1.2s	3.4s	n/a	n/a	4.1s
Composer apply-rate · PRs merged with no edits	78%	41%	—	—	29%
Orchestrator wall-clock · 30h migration	29h 41m	n/a	n/a	n/a	62h+
Audit log throughput · events/sec	84,000	8,200	5,400	12,000	1,800
RTO · regional failover	47s	12m	9m	6m	manual
Mean prebuild warm time	0.8s	21s	8s	6s	33s
PII scrub recall · test corpus	0.998	0.91	n/a	n/a	0.84

Source · exAI Customer Advisory Board · q1 2026 · n=40 · 90d

Download matrix · CSV (28 KB)

§ 04 / 06

AI-assisted PRs merged / engineer / month

Throughput
inverts at the M tier.

Throughput peaks at the 50–500 engineer tier and decays as orgs grow. The plateau is review bandwidth, not generation. Read the heuristic narrative beside the chart.

Small orgs (S) move fast because everyone reviews. The M tier optimizes — Composer's plan-diff-apply flow lands more merge-ready PRs and review queues stay under a working day.

At L and beyond, throughput per engineer inverts — review queues dilate, security and architectural reviewers become bottlenecks, and the marginal AI-PR competes with mandatory human review on every code path. The tooling didn't slow down. The org chart did.

XXL throughput (18 PRs/eng/mo) still beats every comparable public benchmark for AI-assisted development at scale — but the gap to M (31) is the most honest argument we have for keeping your platform-engineering org tight.

· cohort N40 tenants
· windowq1 2026 · 90d
· definitionComposer-authored · merged
· silent merges0

throughput · ai-prs / eng / month · by tier

scale · 0 → 36

S<50 eng

26n=9

M50–500 eng

31n=14

L500–2K eng

24n=9

XL2K–10K eng

20n=5

XXL10K+ eng

18n=3

09182736

methodology · /benchmarks/throughput.mdraw csv · 412 KB

Cohort median

0 PRs / eng / mo

XXL tier median

0 PRs / eng / mo

Silent merges

Reviewer ack-rate

0 %

§ 05 / 06

3-year TCO · three scenarios

Three seat counts.
One accounting truth.

Total cost of ownership over 36 months — workspace compute, AI tokens, support, and ops effort priced in. Inputs and assumptions are versioned alongside the dataset so your finance team can rebuild the model from primary numbers.

Scenario 013-year · usd

50 seats · seed-stage platform

$0K

· Workspace compute$71K

· AI tokens · routed$58K

· Support · 9×5$22K

· Ops effort · 0.4 FTE$33K

Scenario 023-year · usd

500 seats · platform org

$0.00M

· Workspace compute$481K

· AI tokens · routed$402K

· Support · 24×7$148K

· Ops effort · 1.8 FTE$209K

Scenario 033-year · usd

5,000 seats · global enterprise

$0.0M

· Workspace compute$3.8M

· AI tokens · routed$3.1M

· Support · TAM + 24×7$1.2M

· Ops effort · 6.5 FTE$1.6M

Methodology

Benchmarked against an equivalent stack of GitHub Codespaces + GitHub Copilot Enterprise + Sentry + Datadog, sized to match each scenario's seat count. Compute prices use list rates from each vendor as of Q1 2026; AI token costs use the cohort-median spend per merged PR. Ops effort is priced at $180k/yr fully-loaded. Full inputs at /benchmarks/tco.csv.

§ 06 / 06

Reproducer harness · open · MIT

Run them yourself.
Then talk to us.

The harness clones, the dataset signs, the methodology versions. We don't ship benchmark numbers we wouldn't hand a skeptical platform engineer a stopwatch for. If yours diverge, file an issue.

Run the harness Join the waitlist

Source · ExAI Customer Advisory Board·Q1 2026 cohort (n=40)·90-day rolling·Verifiable at /benchmarks/raw

The numbersplatform teams ran.

From API requestto terminal- ready.

Side by side.Numbers only.

Throughputinverts at the M tier.

Three seat counts.One accounting truth.

Run them yourself.Then talk to us.