This leaderboard compiles benchmark scores for 20+ major AI models across the most important evaluation datasets. Gemini 2.5 Ultra and GPT-5 lead across most benchmarks; DeepSeek V3 and Qwen 2.5 lead the open-weight category. Below you'll find the scores alongside plain-English explanations of what each benchmark actually measures.

Last Updated: April 2026. Scores reflect the latest available results from official model cards, technical reports, and independent evaluations. Some scores are from provider-reported results; others from independent reproductions. See linked sources for methodology.

Understanding the Benchmarks

Before the numbers, a quick guide to what each benchmark actually measures — and why the scores don't tell the whole story.

MMLU (Massive Multitask Language Understanding)

57 academic subjects from high-school to expert level. Questions span humanities, STEM, law, medicine, and more. A broad proxy for general knowledge breadth. Weakness: multiple-choice format doesn't test generation quality; some questions have been found in training data (contamination).

HumanEval

164 Python programming problems ranging from easy to hard. Measures functional code generation — whether the code actually works, not just whether it looks right. One of the more reliable benchmarks for practical coding capability.

MATH

12,500 competition mathematics problems across 7 subjects. Tests multi-step mathematical reasoning, not just arithmetic. Strong correlation with scientific and engineering reasoning capability. Very hard — frontier models are still below 95%.

GPQA (Graduate-Level Google-Proof QA)

448 expert-level questions in biology, chemistry, and physics — deliberately hard enough that PhD students can't reliably answer them without reference material. One of the harder benchmarks to fake; high contamination risk is lower here.

MT-Bench

Scored by GPT-4 on 80 multi-turn questions covering writing, roleplay, reasoning, math, and coding. Assesses conversational capability and following instructions over multiple turns. Score is out of 10.

BBH (BIG-Bench Hard)

23 challenging tasks from the BIG-Bench suite that human experts couldn't solve reliably. Tests unusual reasoning patterns, algorithmic thinking, and multi-step logical inference.

The Leaderboard

Model	Provider	MMLU	HumanEval	MATH	GPQA	MT-Bench
GPT-5	OpenAI	92.1%	95.3%	91.4%	75.8%	9.4
Gemini 2.5 Ultra	Google	91.8%	88.5%	92.1%	76.2%	9.3
Claude 4 Opus	Anthropic	89.4%	91.2%	84.5%	70.1%	9.3
o3	OpenAI	87.5%	96.7%	97.1%	87.7%	—
Claude 3.5 Sonnet	Anthropic	88.7%	93.7%	73.4%	59.4%	9.2
GPT-4o	OpenAI	87.2%	90.2%	76.6%	53.6%	9.0
Gemini 2.0 Pro	Google	86.4%	84.1%	89.4%	72.5%	9.1
DeepSeek V3	DeepSeek AI	88.5%	82.6%	90.2%	59.1%	9.1
Grok 3	xAI	85.0%	81.4%	79.5%	52.4%	8.9
Qwen 2.5 72B	Alibaba	86.1%	86.6%	83.1%	49.2%	8.9
Llama 4 Maverick	Meta	85.5%	85.5%	79.5%	48.5%	8.7
Mistral Large 2	Mistral AI	84.0%	83.5%	71.2%	41.2%	8.6
Gemini 2.0 Flash	Google	82.7%	79.3%	79.7%	47.6%	8.5
GPT-4o mini	OpenAI	82.0%	87.2%	70.2%	40.2%	8.4
Claude 3.5 Haiku	Anthropic	79.8%	84.4%	60.9%	38.1%	8.3
Llama 4 Scout	Meta	79.4%	79.8%	72.3%	40.1%	8.2
DeepSeek R1	DeepSeek AI	84.5%	89.4%	97.3%	71.5%	—
Mistral Small	Mistral AI	72.2%	68.6%	58.1%	30.4%	7.8
Llama 3.3 70B	Meta	80.0%	77.5%	67.2%	39.8%	8.0
Gemini 2.0 Flash Lite	Google	77.1%	71.2%	68.4%	36.5%	7.8

Key Insights from the Leaderboard

Reasoning Models Are in a Different Category

o3 and DeepSeek R1 are reasoning models — they use extended chain-of-thought computation to solve problems. Their MATH scores (97.1% and 97.3%) are dramatically higher than standard models because they're purpose-built for this. The same applies to GPQA. For math and science, reasoning models aren't just better — they're in a different league.

The Open-Weight Frontier Has Arrived

DeepSeek V3's 88.5% MMLU score outperforms GPT-4o's 87.2%. Qwen 2.5 72B and Llama 4 Maverick are within 2–3 points of GPT-4o across most benchmarks. The capability gap between open and closed models at the frontier has effectively closed on standardized tests.

Benchmarks Don't Predict Everything

Claude 3.5 Sonnet scores lower than GPT-4o on MATH (73.4% vs 76.6%) but consistently outperforms GPT-4o on real-world writing, coding, and instruction-following tasks in our tests. Benchmark scores are correlated with capability but don't capture everything that matters for practical use.

Frequently Asked Questions

Which AI model scores highest on benchmarks?

GPT-5 and Gemini 2.5 Ultra lead on most general benchmarks. For math specifically, o3 and DeepSeek R1 are at the top with ~97% MATH scores. For coding, o3 and GPT-5 lead with 95–97% HumanEval.

Are benchmark scores reliable indicators of real-world performance?

They're useful signals but not perfect predictors. Benchmark contamination (training data overlap with test sets) inflates some scores. Real-world performance depends heavily on task type, prompt quality, and what specifically you're measuring. Always test models on your actual use case.

How often is this leaderboard updated?

We update this article within two weeks of any major model release. Check the "Last Updated" date at the top for the current version.

What benchmark best predicts coding ability?

HumanEval is the most commonly used coding benchmark, but MBPP (Mostly Basic Python Programming) is more representative of practical code tasks. For real code quality assessment, neither benchmark substitutes for testing on your actual codebase.

LLM Benchmark Leaderboard 2025: MMLU, HumanEval, MATH, and More