v2026.04
Read release notes
exAI Agentic OSexAI
§ 01 / 06
Benchmarks · independently runnableMethodology · publishedRe-run them on your own cluster.
cohort live · n=40 · 90-day rolling
exAI Agentic OS · benchmarks · raw datasets · reproducer harness

The numbers
platform teams ran.

A peer-reviewable benchmark report. Every claim traces to a method line. Every dataset is downloadable.

Cohort of 40 enterprise tenants, Q1 2026, 90-day rolling. Methodology and raw datasets published. Reproducer harness on GitHub. We expect engineers to re-run the harness against their own infrastructure and tell us where our numbers don't match.

cohort summary · q1 2026
v2026.04
Tenants in cohort
0
enterprise · self-reported
Days · rolling
0
Jan 7 → Apr 7 · 2026
Cold-start P50
0 ms
1,000-trial mean
AI-PRs / eng / mo
0
cohort median
sha-256 · 4f82a1c…verify →
Fig. 01 · cohort snapshot · q1 2026n=40 · 90d
§ 02 / 06
Workspace cold-start · P50/P90/P99

From API request
to terminal- ready.

The clock starts when the workspace API receives the create request. It stops when the first interactive shell prompt is ready. No client-side bias, no warmed-pool cherry-picking.

We measured 1,000 cold starts per vendor across three regions (us-east-1, eu-west-3, ap-southeast-1) and four tenant sizes (10/50/200/1,000 dev seats). The harness drains warm pools before every trial — so what you're seeing is the genuine first-byte-cold path, not a hot snapshot.

The harness is open. The dataset is signed. The methodology is versioned in-repo and the trials carry the commit SHA they ran against. If your numbers diverge, the diagnostic harness will tell you which percentile flipped and on which tenant size.

  • · trials / vendor1,000
  • · regions3
  • · tenant sizes10 / 50 / 200 / 1k
  • · clockAPI → shell prompt
  • · warm-pool drainenforced
workspace.cold-start · n=5,000 trials
scale · 0 → 60s
vendor
p50 · p90 · p99
exAI
p50
612ms
p90
1.2s
p99
2.6s
GitHub Codespaces
p50
38s
p90
52s
p99
71s
Gitpod
p50
14s
p90
22s
p99
39s
Coder
p50
9.2s
p90
18s
p99
31s
Self-host fork
p50
45s
p90
62s
p99
92s
0s15s30s45s60s
methodology · /benchmarks/cold-start.mdsha-256 · 4f82a1c · signed
Fig. 02 · cold-start distribution · 5 vendorsP50 · P90 · P99
§ 03 / 06
Comparison matrix · seven metrics

Side by side.
Numbers only.

Seven metrics chosen because platform teams asked them in procurement reviews. exAI numbers in cobalt; competitor numbers from public benchmarks, vendor-published SLOs, and the cohort harness re-run against each vendor's offering.

MetricexAIGitHub CodespacesGitpodCoderSelf-host fork
Composer P50 latency1.2s3.4sn/an/a4.1s
Composer apply-rate · PRs merged with no edits78%41%29%
Orchestrator wall-clock · 30h migration29h 41mn/an/an/a62h+
Audit log throughput · events/sec84,0008,2005,40012,0001,800
RTO · regional failover47s12m9m6mmanual
Mean prebuild warm time0.8s21s8s6s33s
PII scrub recall · test corpus0.9980.91n/an/a0.84
Source · exAI Customer Advisory Board · q1 2026 · n=40 · 90d
Download matrix · CSV (28 KB)
§ 04 / 06
AI-assisted PRs merged / engineer / month

Throughput
inverts at the M tier.

Throughput peaks at the 50–500 engineer tier and decays as orgs grow. The plateau is review bandwidth, not generation. Read the heuristic narrative beside the chart.

Small orgs (S) move fast because everyone reviews. The M tier optimizes — Composer's plan-diff-apply flow lands more merge-ready PRs and review queues stay under a working day.

At L and beyond, throughput per engineer inverts — review queues dilate, security and architectural reviewers become bottlenecks, and the marginal AI-PR competes with mandatory human review on every code path. The tooling didn't slow down. The org chart did.

XXL throughput (18 PRs/eng/mo) still beats every comparable public benchmark for AI-assisted development at scale — but the gap to M (31) is the most honest argument we have for keeping your platform-engineering org tight.

  • · cohort N40 tenants
  • · windowq1 2026 · 90d
  • · definitionComposer-authored · merged
  • · silent merges0
throughput · ai-prs / eng / month · by tier
scale · 0 → 36
S<50 eng
26n=9
M50–500 eng
31n=14
L500–2K eng
24n=9
XL2K–10K eng
20n=5
XXL10K+ eng
18n=3
09182736
Cohort median
0 PRs / eng / mo
XXL tier median
0 PRs / eng / mo
Silent merges
0
Reviewer ack-rate
0 %
§ 05 / 06
3-year TCO · three scenarios

Three seat counts.
One accounting truth.

Total cost of ownership over 36 months — workspace compute, AI tokens, support, and ops effort priced in. Inputs and assumptions are versioned alongside the dataset so your finance team can rebuild the model from primary numbers.

Scenario 013-year · usd
50 seats · seed-stage platform
$0K
· Workspace compute$71K
· AI tokens · routed$58K
· Support · 9×5$22K
· Ops effort · 0.4 FTE$33K
Scenario 023-year · usd
500 seats · platform org
$0.00M
· Workspace compute$481K
· AI tokens · routed$402K
· Support · 24×7$148K
· Ops effort · 1.8 FTE$209K
Scenario 033-year · usd
5,000 seats · global enterprise
$0.0M
· Workspace compute$3.8M
· AI tokens · routed$3.1M
· Support · TAM + 24×7$1.2M
· Ops effort · 6.5 FTE$1.6M
Methodology

Benchmarked against an equivalent stack of GitHub Codespaces + GitHub Copilot Enterprise + Sentry + Datadog, sized to match each scenario's seat count. Compute prices use list rates from each vendor as of Q1 2026; AI token costs use the cohort-median spend per merged PR. Ops effort is priced at $180k/yr fully-loaded. Full inputs at /benchmarks/tco.csv.

§ 06 / 06
Reproducer harness · open · MIT

Run them yourself.
Then talk to us.

The harness clones, the dataset signs, the methodology versions. We don't ship benchmark numbers we wouldn't hand a skeptical platform engineer a stopwatch for. If yours diverge, file an issue.

Source · ExAI Customer Advisory Board·Q1 2026 cohort (n=40)·90-day rolling·Verifiable at /benchmarks/raw