Methodology — Close, But How Close?

A 5-phase pipeline ingests data from 3 independent sources, matches models and benchmarks using LLM-assisted consensus, and produces verified comparable pairs for analysis. Most data is lost at a single bottleneck: cross-source overlap.

Each link's thickness is proportional to the number of scores flowing through it. The thin stream reaching "Matched Pairs" is the core bottleneck.

The dominant loss is structural: most benchmarks are only measured by one source.

Gray background = total pairs. Green = verified pairs that passed model, benchmark, and methodology checks.

Data Sources

Artificial Analysis

Third-party standardized API evaluations. Independent, reproducible testing environment.

5,031 scores

Epoch AI

Mixed-source benchmark database. 176 external evaluations + 177 developer-reported scores.

44 benchmarks · 353 scores

LLM Stats

Aggregates self-reported scores from developer papers and announcements.

2,315 scores

Model Matching

Models appear under different names across sources (e.g., "GPT-5 (high)" in Artificial Analysis vs "gpt-5-high" in LLM Stats). We use a 3-phase matching pipeline:

Phase 1: Organization Resolution

Source names for model creators vary widely — "Google DeepMind", "Google", "DeepMind", and "Alphabet" may all refer to the same organization. An LLM clusters all unique organization strings across our three data sources into canonical groups, producing a mapping table (e.g., all Google variants → "Google"). This narrows the search space so that model matching in Phase 2 only compares models within the same org.

Phase 2: Per-Org Model Grouping

Within each organization, the pipeline collects every model name variant from every source and asks an LLM to group them into canonical model identities. For each group, the LLM assigns:

Canonical name — a clean display name (e.g., "Claude 3.5 Sonnet")
Confidence score (0–1) — how certain the match is. Names like "claude-3.5-sonnet" and "Claude 3.5 Sonnet" get high confidence; ambiguous cases like "Claude 3.5 Sonnet v2" vs "Claude 3.5 Sonnet (Oct 2024)" get lower confidence
Hierarchy level — organization → family → variant → configuration, so we can distinguish between, say, GPT-4o (family) and GPT-4o-mini (variant)

The LLM processes each organization in a single prompt, seeing all source names at once. This avoids pairwise comparisons (which would scale quadratically) and lets the model reason about the full set of names holistically.

Phase 3: Consensus Verification

Any match from Phase 2 with confidence below 0.85 is re-evaluated using a 3-pass LLM consensus protocol. The same matching prompt is run 3 times at increasing temperatures [1.0, 1.2, 1.4] to test robustness — if all three passes agree, the match is accepted; if they disagree, the match is flagged for manual review or rejected. This catches borderline cases like distinguishing "Gemini 2.5 Pro" from "Gemini 2.5 Pro (Deep Research)" that might be conflated in a single pass.

Benchmark Matching

The same 3-phase process is applied separately to benchmark names. Different sources use different naming conventions — "GPQA" vs "GPQA Diamond" vs "gpqa_diamond" — and may refer to different subsets or versions of the same benchmark. The LLM resolves these into canonical benchmark identities, flagging version mismatches (e.g., AIME 2024 vs AIME 2025) as non-comparable.

Comparability Assessment

Once models and benchmarks are matched, each (model, benchmark) pair with both self-reported and third-party scores undergoes three checks to determine its quality tier:

Model match — Are we comparing the same model variant? (deterministic, based on Phase 2 confidence)
Benchmark match — Same benchmark version, subset, and year? (deterministic, based on Phase 3 confidence)
Methodology match — Same evaluation setup (prompting, scoring, tools)? (hybrid: deterministic when metadata available, LLM consensus when not)

Pairs are assigned Gold (all three match), Silver (uncertain on one), or Bronze (explicitly different methodology) tiers. Excluded pairs (different model or benchmark) are discarded. Outlier pairs — where the gap exceeds 3 standard deviations from the benchmark mean — are flagged separately to prevent data-entry errors from skewing results.

Known Limitations

Coverage is sparse — only 15 benchmarks have sufficient cross-validated data
Cannot detect benchmaxing (training on benchmark data) or distillation
Single LLM verifier (Gemini 3 Flash) could introduce systematic bias
Snapshot in time — results reflect model landscape as of 2026-03-21
Selection bias toward models with both self-reported and third-party scores

Methodology Audit

We conducted a comprehensive audit identifying 27 methodological issues across severity levels. High-severity findings include variant selection bias toward reasoning models, heuristic score normalization, and arbitrary outlier thresholds. Full audit available in the research paper appendix.

Read the full paper

"Close, But How Close?" includes the complete methodology appendix, statistical analysis, and all 27 audit findings with severity classifications.

Citation & Data Explore Data

How We Built This

What Is a Matched Pair?