Methodology
How We Built This
A 5-phase pipeline ingests data from 3 independent sources, matches models and benchmarks using LLM-assisted consensus, and produces verified comparable pairs for analysis. Most data is lost at a single bottleneck: cross-source overlap.
Each link's thickness is proportional to the number of scores flowing through it. The thin stream reaching "Matched Pairs" is the core bottleneck.
What Is a Matched Pair?
A matched pair exists when the same model is tested on the same benchmark by both a developer (self-reported) and an independent evaluator (third-party). Comparing the two scores reveals reporting bias.
MiMo-V2-Flash
Benchmark: GPQA Diamond
83.7%
Source: Developer paper / LLM Stats
MiMo-V2-Flash
Benchmark: GPQA Diamond
83.5%
Source: Artificial Analysis
This pair's +0.2 pp difference is typical of the overall mean over-reporting bias we observe.
The dominant loss is structural: most benchmarks are only measured by one source.
Gray background = total pairs. Green = verified pairs that passed model, benchmark, and methodology checks.
Data Sources
Artificial Analysis
Third-party standardized API evaluations. Independent, reproducible testing environment.
5,031 scores
Epoch AI
Mixed-source benchmark database. 176 external evaluations + 177 developer-reported scores.
44 benchmarks · 353 scores
LLM Stats
Aggregates self-reported scores from developer papers and announcements.
2,315 scores
Model Matching
Models appear under different names across sources (e.g., "GPT-5 (high)" in Artificial Analysis vs "gpt-5-high" in LLM Stats). We use a 3-phase matching pipeline:
Phase 1: Organization Resolution
Source names for model creators vary widely — "Google DeepMind", "Google", "DeepMind", and "Alphabet" may all refer to the same organization. An LLM clusters all unique organization strings across our three data sources into canonical groups, producing a mapping table (e.g., all Google variants → "Google"). This narrows the search space so that model matching in Phase 2 only compares models within the same org.
Phase 2: Per-Org Model Grouping
Within each organization, the pipeline collects every model name variant from every source and asks an LLM to group them into canonical model identities. For each group, the LLM assigns:
- Canonical name — a clean display name (e.g., "Claude 3.5 Sonnet")
- Confidence score (0–1) — how certain the match is. Names like "claude-3.5-sonnet" and "Claude 3.5 Sonnet" get high confidence; ambiguous cases like "Claude 3.5 Sonnet v2" vs "Claude 3.5 Sonnet (Oct 2024)" get lower confidence
- Hierarchy level — organization → family → variant → configuration, so we can distinguish between, say, GPT-4o (family) and GPT-4o-mini (variant)
The LLM processes each organization in a single prompt, seeing all source names at once. This avoids pairwise comparisons (which would scale quadratically) and lets the model reason about the full set of names holistically.
Phase 3: Consensus Verification
Any match from Phase 2 with confidence below 0.85 is re-evaluated using a 3-pass LLM consensus protocol. The same matching prompt is run 3 times at increasing temperatures [1.0, 1.2, 1.4] to test robustness — if all three passes agree, the match is accepted; if they disagree, the match is flagged for manual review or rejected. This catches borderline cases like distinguishing "Gemini 2.5 Pro" from "Gemini 2.5 Pro (Deep Research)" that might be conflated in a single pass.
Benchmark Matching
The same 3-phase process is applied separately to benchmark names. Different sources use different naming conventions — "GPQA" vs "GPQA Diamond" vs "gpqa_diamond" — and may refer to different subsets or versions of the same benchmark. The LLM resolves these into canonical benchmark identities, flagging version mismatches (e.g., AIME 2024 vs AIME 2025) as non-comparable.
Comparability Assessment
Once models and benchmarks are matched, each (model, benchmark) pair with both self-reported and third-party scores undergoes three checks to determine its quality tier:
- Model match — Are we comparing the same model variant? (deterministic, based on Phase 2 confidence)
- Benchmark match — Same benchmark version, subset, and year? (deterministic, based on Phase 3 confidence)
- Methodology match — Same evaluation setup (prompting, scoring, tools)? (hybrid: deterministic when metadata available, LLM consensus when not)
Pairs are assigned Gold (all three match), Silver (uncertain on one), or Bronze (explicitly different methodology) tiers. Excluded pairs (different model or benchmark) are discarded. Outlier pairs — where the gap exceeds 3 standard deviations from the benchmark mean — are flagged separately to prevent data-entry errors from skewing results.
Known Limitations
- Coverage is sparse — only 15 benchmarks have sufficient cross-validated data
- Cannot detect benchmaxing (training on benchmark data) or distillation
- Single LLM verifier (Gemini 3 Flash) could introduce systematic bias
- Snapshot in time — results reflect model landscape as of 2026-03-21
- Selection bias toward models with both self-reported and third-party scores
Methodology Audit
We conducted a comprehensive audit identifying 27 methodological issues across severity levels. High-severity findings include variant selection bias toward reasoning models, heuristic score normalization, and arbitrary outlier thresholds. Full audit available in the research paper appendix.
Read the full paper
"Close, But How Close?" includes the complete methodology appendix, statistical analysis, and all 27 audit findings with severity classifications.