SPAR Research

Close, But How Close?

Quantifying the US-China AI Capabilities Gap Across Multiple Evaluation Sources

1.4% – 6.8%

Frontier capability gap

Varies nearly 5x by methodology

+5.0 pp

Self-reported over-reporting

95% CI: [3.51, 6.55]

p = 0.29

US vs China bias difference

Not significant

447 models · 194 verified pairs · 3 data sources · Updated 2026-03-20

Section 1

The Gap

The US maintains a lead in frontier AI capabilities, but the gap has narrowed substantially since 2023. Crucially, the measured gap ranges nearly 5x depending on which metric you use.

Each point is a model. Lines track the running frontier (best score) per region. Stars mark frontier-advancing models.

Three independent metrics tracking the frontier gap. Dotted lines show OLS trends. Divergence reflects measurement uncertainty.

Each bar shows how many days after a US frontier advance before China reached the same capability level.

Section 2

Can We Trust the Numbers?

Labs self-report benchmark scores that are, on average, 5.0 percentage points higher than independent evaluations. This bias is statistically indistinguishable between US and Chinese labs.

Each point is a verified model-benchmark pair. Points above the diagonal = self-reported exceeds third-party. Shaded zone shows over-reporting region.

The distribution of all reporting gaps. The rightward skew and positive mean confirm systematic over-reporting.

Point estimates with 95% bootstrap CIs per benchmark. Colored markers indicate statistical significance.

Left: Per-model gap by region — the narrative that Chinese labs are uniquely untrustworthy is not supported. Right: Does model capability predict reporting accuracy?

Section 3

Why Measurement Is Hard

The AI evaluation landscape is sparse. Most model-benchmark pairs lack independent verification. We're making policy decisions with far less data than commonly assumed.

Each column is a model. Each row is a benchmark. Most of this matrix is empty — that's the problem.

447

Models tracked

291

Benchmarks observed

194

Verified comparable pairs

15

Benchmarks with cross-validation

Mean discrepancy by developer with 95% CI. Diamond points show individual benchmark pairs. Toggle sort order.

Section 4

What It Costs

Cost leadership varies by capability tier. At the frontier, only US models compete. At near-frontier levels, Chinese models often offer better price-performance.

Lower-left is better (cheaper + more capable). Lines connect Pareto-optimal models per region. Frontier models labeled.

Section 5

So What?

The gap is real but modest, uncertain, and more nuanced than headlines suggest.

For AI Safety

Narrow gaps (1-7%) may intensify competitive dynamics. Game-theoretic models suggest teams near parity reduce safety investment. The uncertainty itself is policy-relevant.

For Policymakers

Confident claims about the gap in either direction are not well-supported. Export controls and compute restrictions premised on specific gap estimates risk unintended consequences.

For Open-Weight Governance

Chinese labs releasing open-weight models means capability becomes globally accessible regardless of which "country leads." The relevant question shifts to safeguards, not origin.

For Practitioners

Self-reported scores are informative starting points but should be supplemented with independent evaluation. No region is uniquely untrustworthy.