Methodology

How We Built This

A 5-phase pipeline ingests data from 3 independent sources, matches models and benchmarks using LLM-assisted consensus, and produces verified comparable pairs for analysis. Most data is lost at a single bottleneck: cross-source overlap.

Each link's thickness is proportional to the number of scores flowing through it. The thin stream reaching "Matched Pairs" is the core bottleneck.

What Is a Matched Pair?

A matched pair exists when the same model is tested on the same benchmark by both a developer (self-reported) and an independent evaluator (third-party). Comparing the two scores reveals reporting bias.

Self-Reported

MiMo-V2-Flash

Benchmark: GPQA Diamond

83.7%

Source: Developer paper / LLM Stats

Third-Party

MiMo-V2-Flash

Benchmark: GPQA Diamond

83.5%

Source: Artificial Analysis

Gap: +0.2 pp

This pair's +0.2 pp difference is typical of the overall mean over-reporting bias we observe.

The dominant loss is structural: most benchmarks are only measured by one source.

Gray background = total pairs. Green = verified pairs that passed model, benchmark, and methodology checks.

Data Sources

Artificial Analysis

Third-party standardized API evaluations. Independent, reproducible testing environment.

5,013 scores

Epoch AI

Mixed-source benchmark database. 149 external evaluations + 157 developer-reported scores.

44 benchmarks · 306 scores

LLM Stats

Aggregates self-reported scores from developer papers and announcements.

1,490 scores

Model Matching

Models appear under different names across sources (e.g., "GPT-5 (high)" in Artificial Analysis vs "gpt-5-high" in LLM Stats). We use a 3-phase matching pipeline:

  1. Organization resolution — LLM-assisted clustering of organization name variants ("Google DeepMind" = "Google" = "DeepMind")
  2. Per-org model grouping — Within each org, LLM groups source names into canonical model identities, assigning confidence scores and hierarchy levels (organization → family → variant → configuration)
  3. Consensus verification — Low-confidence matches re-evaluated with 3-pass LLM consensus at varying temperatures [1.0, 1.2, 1.4]

Comparability Assessment

For each (model, benchmark) pair with both self-reported and third-party scores, three checks determine quality tier:

  • Model match — Are we comparing the same model variant? (deterministic, based on Phase 2 confidence)
  • Benchmark match — Same benchmark version, subset, and year? (deterministic, based on Phase 3 confidence)
  • Methodology match — Same evaluation setup (prompting, scoring, tools)? (hybrid: deterministic when metadata available, LLM consensus when not)

Pairs are assigned Gold (all three match), Silver (uncertain on one), or Bronze (explicitly different methodology) tiers. Excluded pairs (different model or benchmark) are discarded.

Known Limitations

  • Coverage is sparse — only 15 benchmarks have sufficient cross-validated data
  • Cannot detect benchmaxing (training on benchmark data) or distillation
  • Single LLM verifier (Gemini 3 Flash) could introduce systematic bias
  • Snapshot in time — results reflect model landscape as of 2026-03-20
  • Selection bias toward models with both self-reported and third-party scores

Methodology Audit

We conducted a comprehensive audit identifying 27 methodological issues across severity levels. High-severity findings include variant selection bias toward reasoning models, heuristic score normalization, and arbitrary outlier thresholds. Full audit available in the research paper appendix.