AI hallucination — when models confidently state false information — varies significantly across models and task types. Understanding which models hallucinate most, on what types of tasks, and how to reduce hallucination risk is essential for any serious AI workflow.
What Is Hallucination?
Hallucination refers to AI models generating confident, plausible-sounding statements that are factually incorrect or fabricated. Common hallucination patterns include:
- Invented citations: Made-up academic papers, books, or sources that don't exist
- Plausible-but-false facts: Incorrect statistics, wrong dates, fabricated quotes
- False expertise: Confident claims in specialized domains that are technically wrong
- Entity confusion: Mixing up people with similar names, conflating events
- Outdated information presented as current: Not hallucination in the traditional sense, but a related accuracy failure
Hallucination Rates by Model
| Model | General Hallucination Rate | Citation Accuracy | Factual Recall |
|---|---|---|---|
| GPT-5 | ~3–5% | High | Very high |
| Claude 4 Opus | ~4–6% | High | Very high |
| Claude 3.5 Sonnet | ~5–8% | High | High |
| GPT-4o | ~6–10% | Moderate | High |
| Gemini 2.5 Ultra | ~5–8% | Moderate-High | High |
| DeepSeek V3 | ~7–12% | Moderate | Moderate-High |
| Mistral Large 2 | ~8–12% | Moderate | Moderate |
| Llama 4 Maverick | ~8–13% | Moderate | Moderate |
| GPT-3.5-turbo | ~15–20% | Low | Moderate |
Hallucination by Task Type
Tasks with High Hallucination Risk
- Citation requests: "Can you find papers about X?" — models routinely fabricate plausible-looking citations, complete with fake authors, journals, and DOIs
- Recent events: Anything past the model's training cutoff is a hallucination risk
- Specific statistics: "What percentage of X?" — models often fabricate precise-sounding numbers
- Less-covered topics: Niche subjects, local information, smaller organizations — models have less training data to draw from and guess more
- Legal and medical specifics: Specific laws, drug interactions, dosages — where being wrong has serious consequences
Tasks with Low Hallucination Risk
- Well-documented facts: Historical events, well-known science, mathematical operations
- Summarization of provided text: When the model works from content you give it, hallucination risk drops substantially
- Code generation: Syntax errors are different from factual hallucination — the code either works or doesn't
- Structure and formatting tasks: Reformatting data, creating templates — low hallucination risk because there are no factual claims
Reasoning Models and Hallucination
Reasoning models (o3, DeepSeek R1, Claude Extended Thinking) show interesting hallucination characteristics. They hallucinate less on complex reasoning tasks because the chain-of-thought process catches errors before the final answer. However, they can confidently pursue incorrect reasoning paths — sometimes with more detailed wrong justifications than standard models would provide.
For factual recall tasks (looking up specific facts), reasoning models are not meaningfully better than standard models at preventing hallucination.
The Citation Hallucination Problem
Citation hallucination deserves special attention because it's both common and consequential. In testing across models, citation hallucination rates for specific academic papers are surprisingly high even in frontier models:
| Model | Citation Hallucination Rate |
|---|---|
| GPT-5 | ~5–10% |
| Claude 3.5 Sonnet | ~8–12% |
| GPT-4o | ~12–18% |
| Older models (GPT-3.5 class) | ~30–50% |
The core problem: models learned from text that discusses papers, so they know how citations should look and sound. When they don't have access to the actual paper, they generate a plausible-looking citation rather than admitting they don't know.
Rule: Never use AI-generated citations without verifying each one in the actual database (Google Scholar, PubMed, etc.).
Techniques to Reduce Hallucination
Retrieval-Augmented Generation (RAG)
The most effective structural intervention: provide the model with retrieved source documents and instruct it to answer only from those documents. When a model is grounding answers in content you gave it rather than memory, hallucination rates drop dramatically — often to near zero for factual questions within the provided context.
Ask for Confidence and Uncertainty
Prompt the model to indicate its confidence: "If you're not certain about a fact, say so rather than guessing." Models trained with RLHF respond to this instruction and will express uncertainty more readily when prompted. This doesn't eliminate hallucination but surfaces it.
Verification Prompts
After receiving a factual answer, ask: "How confident are you in each of the specific facts you stated? Which ones should I verify independently?" Models will often correctly identify which claims are uncertain.
Use Web Search Integration
Models with live web search integration (Perplexity, ChatGPT with search, Gemini with search) have substantially lower hallucination rates on current events and verifiable facts. The search tool grounds answers in actual sources.
Temperature and Sampling
Higher temperature settings increase creative output variability — and can increase hallucination on factual tasks. For tasks requiring factual accuracy, use lower temperatures (0.1–0.3) rather than defaults (0.7–1.0).
How to Verify AI-Generated Claims
- Statistics and data: Search for the original source, not the number
- Citations: Check DOI or search title directly in Google Scholar
- Quotes: Search the exact quote to find the original context
- Recent events: Search for news coverage from the claimed date
- Expert claims: Search the person's name with the claim to verify attribution
Frequently Asked Questions
Will AI hallucination ever be solved?
Current evidence suggests hallucination can be significantly reduced but may be fundamentally difficult to eliminate from probabilistic language models. RAG, better training data, and improved RLHF continue reducing rates. Whether it reaches functionally zero is unclear. For now, human verification of high-stakes claims remains necessary.
Are smaller models worse at hallucination?
Generally yes — smaller models have less knowledge and are more likely to fill gaps with plausible-sounding guesses. But model size isn't the only factor; training methodology and RLHF significantly affect hallucination rates. Some smaller, better-trained models outperform larger poorly-trained ones.
Which AI tool has the lowest hallucination rate for research?
Perplexity.ai (with web search enabled) has the lowest effective hallucination rate for research tasks because it retrieves sources rather than relying purely on memory. Among pure language models, GPT-5 and Claude 4 Opus have the lowest hallucination rates on general factual tasks in current evaluations.