This leaderboard compiles benchmark scores for 20+ major AI models across the most important evaluation datasets. Gemini 2.5 Ultra and GPT-5 lead across most benchmarks; DeepSeek V3 and Qwen 2.5 lead the open-weight category. Below you'll find the scores alongside plain-English explanations of what each benchmark actually measures.
Understanding the Benchmarks
Before the numbers, a quick guide to what each benchmark actually measures — and why the scores don't tell the whole story.
MMLU (Massive Multitask Language Understanding)
57 academic subjects from high-school to expert level. Questions span humanities, STEM, law, medicine, and more. A broad proxy for general knowledge breadth. Weakness: multiple-choice format doesn't test generation quality; some questions have been found in training data (contamination).
HumanEval
164 Python programming problems ranging from easy to hard. Measures functional code generation — whether the code actually works, not just whether it looks right. One of the more reliable benchmarks for practical coding capability.
MATH
12,500 competition mathematics problems across 7 subjects. Tests multi-step mathematical reasoning, not just arithmetic. Strong correlation with scientific and engineering reasoning capability. Very hard — frontier models are still below 95%.
GPQA (Graduate-Level Google-Proof QA)
448 expert-level questions in biology, chemistry, and physics — deliberately hard enough that PhD students can't reliably answer them without reference material. One of the harder benchmarks to fake; high contamination risk is lower here.
MT-Bench
Scored by GPT-4 on 80 multi-turn questions covering writing, roleplay, reasoning, math, and coding. Assesses conversational capability and following instructions over multiple turns. Score is out of 10.
BBH (BIG-Bench Hard)
23 challenging tasks from the BIG-Bench suite that human experts couldn't solve reliably. Tests unusual reasoning patterns, algorithmic thinking, and multi-step logical inference.
The Leaderboard
| Model | Provider | MMLU | HumanEval | MATH | GPQA | MT-Bench |
|---|---|---|---|---|---|---|
| GPT-5 | OpenAI | 92.1% | 95.3% | 91.4% | 75.8% | 9.4 |
| Gemini 2.5 Ultra | 91.8% | 88.5% | 92.1% | 76.2% | 9.3 | |
| Claude 4 Opus | Anthropic | 89.4% | 91.2% | 84.5% | 70.1% | 9.3 |
| o3 | OpenAI | 87.5% | 96.7% | 97.1% | 87.7% | — |
| Claude 3.5 Sonnet | Anthropic | 88.7% | 93.7% | 73.4% | 59.4% | 9.2 |
| GPT-4o | OpenAI | 87.2% | 90.2% | 76.6% | 53.6% | 9.0 |
| Gemini 2.0 Pro | 86.4% | 84.1% | 89.4% | 72.5% | 9.1 | |
| DeepSeek V3 | DeepSeek AI | 88.5% | 82.6% | 90.2% | 59.1% | 9.1 |
| Grok 3 | xAI | 85.0% | 81.4% | 79.5% | 52.4% | 8.9 |
| Qwen 2.5 72B | Alibaba | 86.1% | 86.6% | 83.1% | 49.2% | 8.9 |
| Llama 4 Maverick | Meta | 85.5% | 85.5% | 79.5% | 48.5% | 8.7 |
| Mistral Large 2 | Mistral AI | 84.0% | 83.5% | 71.2% | 41.2% | 8.6 |
| Gemini 2.0 Flash | 82.7% | 79.3% | 79.7% | 47.6% | 8.5 | |
| GPT-4o mini | OpenAI | 82.0% | 87.2% | 70.2% | 40.2% | 8.4 |
| Claude 3.5 Haiku | Anthropic | 79.8% | 84.4% | 60.9% | 38.1% | 8.3 |
| Llama 4 Scout | Meta | 79.4% | 79.8% | 72.3% | 40.1% | 8.2 |
| DeepSeek R1 | DeepSeek AI | 84.5% | 89.4% | 97.3% | 71.5% | — |
| Mistral Small | Mistral AI | 72.2% | 68.6% | 58.1% | 30.4% | 7.8 |
| Llama 3.3 70B | Meta | 80.0% | 77.5% | 67.2% | 39.8% | 8.0 |
| Gemini 2.0 Flash Lite | 77.1% | 71.2% | 68.4% | 36.5% | 7.8 |
Key Insights from the Leaderboard
Reasoning Models Are in a Different Category
o3 and DeepSeek R1 are reasoning models — they use extended chain-of-thought computation to solve problems. Their MATH scores (97.1% and 97.3%) are dramatically higher than standard models because they're purpose-built for this. The same applies to GPQA. For math and science, reasoning models aren't just better — they're in a different league.
The Open-Weight Frontier Has Arrived
DeepSeek V3's 88.5% MMLU score outperforms GPT-4o's 87.2%. Qwen 2.5 72B and Llama 4 Maverick are within 2–3 points of GPT-4o across most benchmarks. The capability gap between open and closed models at the frontier has effectively closed on standardized tests.
Benchmarks Don't Predict Everything
Claude 3.5 Sonnet scores lower than GPT-4o on MATH (73.4% vs 76.6%) but consistently outperforms GPT-4o on real-world writing, coding, and instruction-following tasks in our tests. Benchmark scores are correlated with capability but don't capture everything that matters for practical use.
Frequently Asked Questions
Which AI model scores highest on benchmarks?
GPT-5 and Gemini 2.5 Ultra lead on most general benchmarks. For math specifically, o3 and DeepSeek R1 are at the top with ~97% MATH scores. For coding, o3 and GPT-5 lead with 95–97% HumanEval.
Are benchmark scores reliable indicators of real-world performance?
They're useful signals but not perfect predictors. Benchmark contamination (training data overlap with test sets) inflates some scores. Real-world performance depends heavily on task type, prompt quality, and what specifically you're measuring. Always test models on your actual use case.
How often is this leaderboard updated?
We update this article within two weeks of any major model release. Check the "Last Updated" date at the top for the current version.
What benchmark best predicts coding ability?
HumanEval is the most commonly used coding benchmark, but MBPP (Mostly Basic Python Programming) is more representative of practical code tasks. For real code quality assessment, neither benchmark substitutes for testing on your actual codebase.