About
About This Project
Close, But How Close? is a research project by Sam Donahue, a SPAR Fellow at the Kairos Foundation, that systematically quantifies the gap between self-reported AI model benchmark scores and independently measured third-party evaluations, with a focus on understanding US-China frontier AI capabilities.
What SPAR Is
SPAR (Supervised Program for Alignment Research) is a research fellowship program that supports rigorous empirical work on AI safety and governance. This project addresses a core challenge in AI policy: the lack of reliable, cross-validated data on comparative AI capabilities.
Key Findings
- The US-China frontier AI capability gap is 1.4% – 6.8% depending on methodology (as of February 2026)
- Self-reported benchmark scores are biased upward by +5.0 percentage points on average
- This bias is statistically indistinguishable between US and Chinese labs (Mann-Whitney p = 0.13)
- The gap has narrowed substantially since 2023, but recent trend direction is uncertain
- Measurement is harder than commonly acknowledged — only 15 of 400+ benchmarks have sufficient cross-validated data
Methodology at a Glance
The analysis pipeline ingests data from three independent sources, matches models and benchmarks across sources using LLM-assisted consensus, and produces verified comparable pairs. The full methodology is available on the Methodology page.
449
Models tracked
350
Benchmarks
194
Verified pairs
Why the Gap Varies 5x
The measured gap depends heavily on which metric you use. Three independent metrics show different gap sizes because they measure different things:
- AAII (~6.8% gap) — Artificial Analysis Intelligence Index. Weighted composite of 14 benchmarks with broad coverage (440 models). Proprietary weighting amplifies differences at the frontier.
- ECI (~3-5% gap) — Epoch Capabilities Index. IRT-based latent ability estimate across 40+ benchmarks. Sparser coverage but methodologically rigorous.
- IRT (~1.4% gap) — Our IRT latent score from third-party evaluations only. Narrowest gap because it filters out self-reported inflation entirely.
None of these is "right" — they reflect different tradeoffs between coverage, independence, and methodology. The range itself is the finding.
Citation
If you use this data or findings in your work, please cite:
@article{donahue2026close,
title={Close, But How Close? Quantifying the US-China AI
Capabilities Gap Across Multiple Evaluation Sources},
author={Donahue, Sam},
year={2026},
institution={Kairos Foundation / SPAR}
} Data & Code
All data powering this website is derived from three public sources: Artificial Analysis, Epoch AI, and LLM Stats. The matching pipeline and analysis code are available on GitHub.
Contact
For questions, feedback, or collaboration, reach out to Sam Donahue via the Kairos Foundation.