Methodology
How We Built This
A 5-phase pipeline ingests data from 3 independent sources, matches models and benchmarks using LLM-assisted consensus, and produces verified comparable pairs for analysis. Most data is lost at a single bottleneck: cross-source overlap.
Each link's thickness is proportional to the number of scores flowing through it. The thin stream reaching "Matched Pairs" is the core bottleneck.
What Is a Matched Pair?
A matched pair exists when the same model is tested on the same benchmark by both a developer (self-reported) and an independent evaluator (third-party). Comparing the two scores reveals reporting bias.
MiMo-V2-Flash
Benchmark: GPQA Diamond
83.7%
Source: Developer paper / LLM Stats
MiMo-V2-Flash
Benchmark: GPQA Diamond
83.5%
Source: Artificial Analysis
This pair's +0.2 pp difference is typical of the overall mean over-reporting bias we observe.
The dominant loss is structural: most benchmarks are only measured by one source.
Gray background = total pairs. Green = verified pairs that passed model, benchmark, and methodology checks.
Data Sources
Artificial Analysis
Third-party standardized API evaluations. Independent, reproducible testing environment.
5,013 scores
Epoch AI
Mixed-source benchmark database. 149 external evaluations + 157 developer-reported scores.
44 benchmarks · 306 scores
LLM Stats
Aggregates self-reported scores from developer papers and announcements.
1,490 scores
Model Matching
Models appear under different names across sources (e.g., "GPT-5 (high)" in Artificial Analysis vs "gpt-5-high" in LLM Stats). We use a 3-phase matching pipeline:
- Organization resolution — LLM-assisted clustering of organization name variants ("Google DeepMind" = "Google" = "DeepMind")
- Per-org model grouping — Within each org, LLM groups source names into canonical model identities, assigning confidence scores and hierarchy levels (organization → family → variant → configuration)
- Consensus verification — Low-confidence matches re-evaluated with 3-pass LLM consensus at varying temperatures [1.0, 1.2, 1.4]
Comparability Assessment
For each (model, benchmark) pair with both self-reported and third-party scores, three checks determine quality tier:
- Model match — Are we comparing the same model variant? (deterministic, based on Phase 2 confidence)
- Benchmark match — Same benchmark version, subset, and year? (deterministic, based on Phase 3 confidence)
- Methodology match — Same evaluation setup (prompting, scoring, tools)? (hybrid: deterministic when metadata available, LLM consensus when not)
Pairs are assigned Gold (all three match), Silver (uncertain on one), or Bronze (explicitly different methodology) tiers. Excluded pairs (different model or benchmark) are discarded.
Known Limitations
- Coverage is sparse — only 15 benchmarks have sufficient cross-validated data
- Cannot detect benchmaxing (training on benchmark data) or distillation
- Single LLM verifier (Gemini 3 Flash) could introduce systematic bias
- Snapshot in time — results reflect model landscape as of 2026-03-20
- Selection bias toward models with both self-reported and third-party scores
Methodology Audit
We conducted a comprehensive audit identifying 27 methodological issues across severity levels. High-severity findings include variant selection bias toward reasoning models, heuristic score normalization, and arbitrary outlier thresholds. Full audit available in the research paper appendix.