About — Close, But How Close?

Close, But How Close? is a research project by Sam Donahue, a SPAR Fellow at the Kairos Foundation, that systematically quantifies the gap between self-reported AI model benchmark scores and independently measured third-party evaluations, with a focus on understanding US-China frontier AI capabilities.

What SPAR Is

SPAR (Supervised Program for Alignment Research) is a research fellowship program that supports rigorous empirical work on AI safety and governance. This project addresses a core challenge in AI policy: the lack of reliable, cross-validated data on comparative AI capabilities.

Key Findings

The US-China frontier AI capability gap is 1.4% – 6.8% depending on methodology (as of February 2026)
Self-reported benchmark scores are biased upward by +5.0 percentage points on average
This bias is statistically indistinguishable between US and Chinese labs (Mann-Whitney p = 0.13)
The gap has narrowed substantially since 2023, but recent trend direction is uncertain
Measurement is harder than commonly acknowledged — only 15 of 400+ benchmarks have sufficient cross-validated data

Methodology at a Glance

The analysis pipeline ingests data from three independent sources, matches models and benchmarks across sources using LLM-assisted consensus, and produces verified comparable pairs. The full methodology is available on the Methodology page.

449

Models tracked

350

Benchmarks

194

Verified pairs

Why the Gap Varies 5x

The measured gap depends heavily on which metric you use. Three independent metrics show different gap sizes because they measure different things:

AAII (~6.8% gap) — Artificial Analysis Intelligence Index. Weighted composite of 14 benchmarks with broad coverage (440 models). Proprietary weighting amplifies differences at the frontier.
ECI (~3-5% gap) — Epoch Capabilities Index. IRT-based latent ability estimate across 40+ benchmarks. Sparser coverage but methodologically rigorous.
IRT (~1.4% gap) — Our IRT latent score from third-party evaluations only. Narrowest gap because it filters out self-reported inflation entirely.

None of these is "right" — they reflect different tradeoffs between coverage, independence, and methodology. The range itself is the finding.

Citation

If you use this data or findings in your work, please cite:

@article{donahue2026close,
  title={Close, But How Close? Quantifying the US-China AI
         Capabilities Gap Across Multiple Evaluation Sources},
  author={Donahue, Sam},
  year={2026},
  institution={Kairos Foundation / SPAR}
}

Data & Code

All data powering this website is derived from three public sources: Artificial Analysis, Epoch AI, and LLM Stats. The matching pipeline and analysis code are available on GitHub.

Explore the Data Full Methodology Benchmark Saturation

Contact

For questions, feedback, or collaboration, reach out to Sam Donahue via the Kairos Foundation.