All articles
Model Comparisons

LLM Benchmark Leaderboard 2025: MMLU, HumanEval, MATH, and More

A comprehensive, regularly updated benchmark table for 20+ major AI models across MMLU, HumanEval, MATH, MT-Bench, and GPQA — with plain-English explanations of what each score actually means.

Travis Johnson

Travis Johnson

Founder, Deepest

July 19, 202514 min read

This leaderboard compiles benchmark scores for 20+ major AI models across the most important evaluation datasets. Gemini 2.5 Ultra and GPT-5 lead across most benchmarks; DeepSeek V3 and Qwen 2.5 lead the open-weight category. Below you'll find the scores alongside plain-English explanations of what each benchmark actually measures.

Last Updated: April 2026. Scores reflect the latest available results from official model cards, technical reports, and independent evaluations. Some scores are from provider-reported results; others from independent reproductions. See linked sources for methodology.

Understanding the Benchmarks

Before the numbers, a quick guide to what each benchmark actually measures — and why the scores don't tell the whole story.

MMLU (Massive Multitask Language Understanding)

57 academic subjects from high-school to expert level. Questions span humanities, STEM, law, medicine, and more. A broad proxy for general knowledge breadth. Weakness: multiple-choice format doesn't test generation quality; some questions have been found in training data (contamination).

HumanEval

164 Python programming problems ranging from easy to hard. Measures functional code generation — whether the code actually works, not just whether it looks right. One of the more reliable benchmarks for practical coding capability.

MATH

12,500 competition mathematics problems across 7 subjects. Tests multi-step mathematical reasoning, not just arithmetic. Strong correlation with scientific and engineering reasoning capability. Very hard — frontier models are still below 95%.

GPQA (Graduate-Level Google-Proof QA)

448 expert-level questions in biology, chemistry, and physics — deliberately hard enough that PhD students can't reliably answer them without reference material. One of the harder benchmarks to fake; high contamination risk is lower here.

MT-Bench

Scored by GPT-4 on 80 multi-turn questions covering writing, roleplay, reasoning, math, and coding. Assesses conversational capability and following instructions over multiple turns. Score is out of 10.

BBH (BIG-Bench Hard)

23 challenging tasks from the BIG-Bench suite that human experts couldn't solve reliably. Tests unusual reasoning patterns, algorithmic thinking, and multi-step logical inference.

The Leaderboard

Model Provider MMLU HumanEval MATH GPQA MT-Bench
GPT-5 OpenAI 92.1% 95.3% 91.4% 75.8% 9.4
Gemini 2.5 Ultra Google 91.8% 88.5% 92.1% 76.2% 9.3
Claude 4 Opus Anthropic 89.4% 91.2% 84.5% 70.1% 9.3
o3 OpenAI 87.5% 96.7% 97.1% 87.7%
Claude 3.5 Sonnet Anthropic 88.7% 93.7% 73.4% 59.4% 9.2
GPT-4o OpenAI 87.2% 90.2% 76.6% 53.6% 9.0
Gemini 2.0 Pro Google 86.4% 84.1% 89.4% 72.5% 9.1
DeepSeek V3 DeepSeek AI 88.5% 82.6% 90.2% 59.1% 9.1
Grok 3 xAI 85.0% 81.4% 79.5% 52.4% 8.9
Qwen 2.5 72B Alibaba 86.1% 86.6% 83.1% 49.2% 8.9
Llama 4 Maverick Meta 85.5% 85.5% 79.5% 48.5% 8.7
Mistral Large 2 Mistral AI 84.0% 83.5% 71.2% 41.2% 8.6
Gemini 2.0 Flash Google 82.7% 79.3% 79.7% 47.6% 8.5
GPT-4o mini OpenAI 82.0% 87.2% 70.2% 40.2% 8.4
Claude 3.5 Haiku Anthropic 79.8% 84.4% 60.9% 38.1% 8.3
Llama 4 Scout Meta 79.4% 79.8% 72.3% 40.1% 8.2
DeepSeek R1 DeepSeek AI 84.5% 89.4% 97.3% 71.5%
Mistral Small Mistral AI 72.2% 68.6% 58.1% 30.4% 7.8
Llama 3.3 70B Meta 80.0% 77.5% 67.2% 39.8% 8.0
Gemini 2.0 Flash Lite Google 77.1% 71.2% 68.4% 36.5% 7.8

Key Insights from the Leaderboard

Reasoning Models Are in a Different Category

o3 and DeepSeek R1 are reasoning models — they use extended chain-of-thought computation to solve problems. Their MATH scores (97.1% and 97.3%) are dramatically higher than standard models because they're purpose-built for this. The same applies to GPQA. For math and science, reasoning models aren't just better — they're in a different league.

The Open-Weight Frontier Has Arrived

DeepSeek V3's 88.5% MMLU score outperforms GPT-4o's 87.2%. Qwen 2.5 72B and Llama 4 Maverick are within 2–3 points of GPT-4o across most benchmarks. The capability gap between open and closed models at the frontier has effectively closed on standardized tests.

Benchmarks Don't Predict Everything

Claude 3.5 Sonnet scores lower than GPT-4o on MATH (73.4% vs 76.6%) but consistently outperforms GPT-4o on real-world writing, coding, and instruction-following tasks in our tests. Benchmark scores are correlated with capability but don't capture everything that matters for practical use.

Frequently Asked Questions

Which AI model scores highest on benchmarks?

GPT-5 and Gemini 2.5 Ultra lead on most general benchmarks. For math specifically, o3 and DeepSeek R1 are at the top with ~97% MATH scores. For coding, o3 and GPT-5 lead with 95–97% HumanEval.

Are benchmark scores reliable indicators of real-world performance?

They're useful signals but not perfect predictors. Benchmark contamination (training data overlap with test sets) inflates some scores. Real-world performance depends heavily on task type, prompt quality, and what specifically you're measuring. Always test models on your actual use case.

How often is this leaderboard updated?

We update this article within two weeks of any major model release. Check the "Last Updated" date at the top for the current version.

What benchmark best predicts coding ability?

HumanEval is the most commonly used coding benchmark, but MBPP (Mostly Basic Python Programming) is more representative of practical code tasks. For real code quality assessment, neither benchmark substitutes for testing on your actual codebase.

LLM benchmarksMMLUHumanEvalleaderboardrankings

See it for yourself

Run any prompt across ChatGPT, Claude, Gemini, and 300+ other models simultaneously. Free to try, no credit card required.

Try Deepest free →

Related articles