SPAR Research
Close, But How Close?
Quantifying the US-China AI Capabilities Gap Across Multiple Evaluation Sources
1.4% – 6.8%
Frontier capability gap
Varies nearly 5x by methodology
+5.0 pp
Self-reported over-reporting
95% CI: [3.51, 6.55]
p = 0.29
US vs China bias difference
Not significant
447 models · 194 verified pairs · 3 data sources · Updated 2026-03-20
Section 1
The Gap
The US maintains a lead in frontier AI capabilities, but the gap has narrowed substantially since 2023. Crucially, the measured gap ranges nearly 5x depending on which metric you use.
Each point is a model. Lines track the running frontier (best score) per region. Stars mark frontier-advancing models.
Three independent metrics tracking the frontier gap. Dotted lines show OLS trends. Divergence reflects measurement uncertainty.
Each bar shows how many days after a US frontier advance before China reached the same capability level.
Section 2
Can We Trust the Numbers?
Labs self-report benchmark scores that are, on average, 5.0 percentage points higher than independent evaluations. This bias is statistically indistinguishable between US and Chinese labs.
Each point is a verified model-benchmark pair. Points above the diagonal = self-reported exceeds third-party. Shaded zone shows over-reporting region.
The distribution of all reporting gaps. The rightward skew and positive mean confirm systematic over-reporting.
Point estimates with 95% bootstrap CIs per benchmark. Colored markers indicate statistical significance.
Left: Per-model gap by region — the narrative that Chinese labs are uniquely untrustworthy is not supported. Right: Does model capability predict reporting accuracy?
Section 3
Why Measurement Is Hard
The AI evaluation landscape is sparse. Most model-benchmark pairs lack independent verification. We're making policy decisions with far less data than commonly assumed.
Each column is a model. Each row is a benchmark. Most of this matrix is empty — that's the problem.
447
Models tracked
291
Benchmarks observed
194
Verified comparable pairs
15
Benchmarks with cross-validation
Mean discrepancy by developer with 95% CI. Diamond points show individual benchmark pairs. Toggle sort order.
Section 4
What It Costs
Cost leadership varies by capability tier. At the frontier, only US models compete. At near-frontier levels, Chinese models often offer better price-performance.
Lower-left is better (cheaper + more capable). Lines connect Pareto-optimal models per region. Frontier models labeled.
Section 5
So What?
The gap is real but modest, uncertain, and more nuanced than headlines suggest.
For AI Safety
Narrow gaps (1-7%) may intensify competitive dynamics. Game-theoretic models suggest teams near parity reduce safety investment. The uncertainty itself is policy-relevant.
For Policymakers
Confident claims about the gap in either direction are not well-supported. Export controls and compute restrictions premised on specific gap estimates risk unintended consequences.
For Open-Weight Governance
Chinese labs releasing open-weight models means capability becomes globally accessible regardless of which "country leads." The relevant question shifts to safeguards, not origin.
For Practitioners
Self-reported scores are informative starting points but should be supplemented with independent evaluation. No region is uniquely untrustworthy.