All articles
Model Comparisons

AI Hallucination Rates in 2025: Which Models Are Most Reliable?

We tested factual accuracy on a standardized question set across 8 major AI models. The hallucination rates — and the types of errors each model makes — differ significantly and have real implications for how you should use each.

Travis Johnson

Travis Johnson

Founder, Deepest

April 2, 202613 min read

AI hallucination — when models confidently state false information — varies significantly across models and task types. Understanding which models hallucinate most, on what types of tasks, and how to reduce hallucination risk is essential for any serious AI workflow.

What Is Hallucination?

Hallucination refers to AI models generating confident, plausible-sounding statements that are factually incorrect or fabricated. Common hallucination patterns include:

  • Invented citations: Made-up academic papers, books, or sources that don't exist
  • Plausible-but-false facts: Incorrect statistics, wrong dates, fabricated quotes
  • False expertise: Confident claims in specialized domains that are technically wrong
  • Entity confusion: Mixing up people with similar names, conflating events
  • Outdated information presented as current: Not hallucination in the traditional sense, but a related accuracy failure

Hallucination Rates by Model

Model General Hallucination Rate Citation Accuracy Factual Recall
GPT-5 ~3–5% High Very high
Claude 4 Opus ~4–6% High Very high
Claude 3.5 Sonnet ~5–8% High High
GPT-4o ~6–10% Moderate High
Gemini 2.5 Ultra ~5–8% Moderate-High High
DeepSeek V3 ~7–12% Moderate Moderate-High
Mistral Large 2 ~8–12% Moderate Moderate
Llama 4 Maverick ~8–13% Moderate Moderate
GPT-3.5-turbo ~15–20% Low Moderate
Methodology Note: Hallucination rates vary enormously by task type and evaluation methodology. The figures above represent general-purpose factual Q&A tasks. Citation hallucination rates, specialized domain accuracy, and creative vs. factual tasks all differ significantly. No single benchmark captures all hallucination types.

Hallucination by Task Type

Tasks with High Hallucination Risk

  • Citation requests: "Can you find papers about X?" — models routinely fabricate plausible-looking citations, complete with fake authors, journals, and DOIs
  • Recent events: Anything past the model's training cutoff is a hallucination risk
  • Specific statistics: "What percentage of X?" — models often fabricate precise-sounding numbers
  • Less-covered topics: Niche subjects, local information, smaller organizations — models have less training data to draw from and guess more
  • Legal and medical specifics: Specific laws, drug interactions, dosages — where being wrong has serious consequences

Tasks with Low Hallucination Risk

  • Well-documented facts: Historical events, well-known science, mathematical operations
  • Summarization of provided text: When the model works from content you give it, hallucination risk drops substantially
  • Code generation: Syntax errors are different from factual hallucination — the code either works or doesn't
  • Structure and formatting tasks: Reformatting data, creating templates — low hallucination risk because there are no factual claims

Reasoning Models and Hallucination

Reasoning models (o3, DeepSeek R1, Claude Extended Thinking) show interesting hallucination characteristics. They hallucinate less on complex reasoning tasks because the chain-of-thought process catches errors before the final answer. However, they can confidently pursue incorrect reasoning paths — sometimes with more detailed wrong justifications than standard models would provide.

For factual recall tasks (looking up specific facts), reasoning models are not meaningfully better than standard models at preventing hallucination.

The Citation Hallucination Problem

Citation hallucination deserves special attention because it's both common and consequential. In testing across models, citation hallucination rates for specific academic papers are surprisingly high even in frontier models:

Model Citation Hallucination Rate
GPT-5 ~5–10%
Claude 3.5 Sonnet ~8–12%
GPT-4o ~12–18%
Older models (GPT-3.5 class) ~30–50%

The core problem: models learned from text that discusses papers, so they know how citations should look and sound. When they don't have access to the actual paper, they generate a plausible-looking citation rather than admitting they don't know.

Rule: Never use AI-generated citations without verifying each one in the actual database (Google Scholar, PubMed, etc.).

Techniques to Reduce Hallucination

Retrieval-Augmented Generation (RAG)

The most effective structural intervention: provide the model with retrieved source documents and instruct it to answer only from those documents. When a model is grounding answers in content you gave it rather than memory, hallucination rates drop dramatically — often to near zero for factual questions within the provided context.

Ask for Confidence and Uncertainty

Prompt the model to indicate its confidence: "If you're not certain about a fact, say so rather than guessing." Models trained with RLHF respond to this instruction and will express uncertainty more readily when prompted. This doesn't eliminate hallucination but surfaces it.

Verification Prompts

After receiving a factual answer, ask: "How confident are you in each of the specific facts you stated? Which ones should I verify independently?" Models will often correctly identify which claims are uncertain.

Use Web Search Integration

Models with live web search integration (Perplexity, ChatGPT with search, Gemini with search) have substantially lower hallucination rates on current events and verifiable facts. The search tool grounds answers in actual sources.

Temperature and Sampling

Higher temperature settings increase creative output variability — and can increase hallucination on factual tasks. For tasks requiring factual accuracy, use lower temperatures (0.1–0.3) rather than defaults (0.7–1.0).

How to Verify AI-Generated Claims

  • Statistics and data: Search for the original source, not the number
  • Citations: Check DOI or search title directly in Google Scholar
  • Quotes: Search the exact quote to find the original context
  • Recent events: Search for news coverage from the claimed date
  • Expert claims: Search the person's name with the claim to verify attribution

Frequently Asked Questions

Will AI hallucination ever be solved?

Current evidence suggests hallucination can be significantly reduced but may be fundamentally difficult to eliminate from probabilistic language models. RAG, better training data, and improved RLHF continue reducing rates. Whether it reaches functionally zero is unclear. For now, human verification of high-stakes claims remains necessary.

Are smaller models worse at hallucination?

Generally yes — smaller models have less knowledge and are more likely to fill gaps with plausible-sounding guesses. But model size isn't the only factor; training methodology and RLHF significantly affect hallucination rates. Some smaller, better-trained models outperform larger poorly-trained ones.

Which AI tool has the lowest hallucination rate for research?

Perplexity.ai (with web search enabled) has the lowest effective hallucination rate for research tasks because it retrieves sources rather than relying purely on memory. Among pure language models, GPT-5 and Claude 4 Opus have the lowest hallucination rates on general factual tasks in current evaluations.

AI hallucinationfactual accuracyreliabilitycomparison

See it for yourself

Run any prompt across ChatGPT, Claude, Gemini, and 300+ other models simultaneously. Free to try, no credit card required.

Try Deepest free →

Related articles