AI hallucination — when models confidently state false information — varies significantly across models and task types. Understanding which models hallucinate most, on what types of tasks, and how to reduce hallucination risk is essential for any serious AI workflow.

What Is Hallucination?

Hallucination refers to AI models generating confident, plausible-sounding statements that are factually incorrect or fabricated. Common hallucination patterns include:

Invented citations: Made-up academic papers, books, or sources that don't exist
Plausible-but-false facts: Incorrect statistics, wrong dates, fabricated quotes
False expertise: Confident claims in specialized domains that are technically wrong
Entity confusion: Mixing up people with similar names, conflating events
Outdated information presented as current: Not hallucination in the traditional sense, but a related accuracy failure

Hallucination Rates by Model

Model	General Hallucination Rate	Citation Accuracy	Factual Recall
GPT-5	~3–5%	High	Very high
Claude 4 Opus	~4–6%	High	Very high
Claude 3.5 Sonnet	~5–8%	High	High
GPT-4o	~6–10%	Moderate	High
Gemini 2.5 Ultra	~5–8%	Moderate-High	High
DeepSeek V3	~7–12%	Moderate	Moderate-High
Mistral Large 2	~8–12%	Moderate	Moderate
Llama 4 Maverick	~8–13%	Moderate	Moderate
GPT-3.5-turbo	~15–20%	Low	Moderate

Methodology Note: Hallucination rates vary enormously by task type and evaluation methodology. The figures above represent general-purpose factual Q&A tasks. Citation hallucination rates, specialized domain accuracy, and creative vs. factual tasks all differ significantly. No single benchmark captures all hallucination types.

Hallucination by Task Type

Tasks with High Hallucination Risk

Citation requests: "Can you find papers about X?" — models routinely fabricate plausible-looking citations, complete with fake authors, journals, and DOIs
Recent events: Anything past the model's training cutoff is a hallucination risk
Specific statistics: "What percentage of X?" — models often fabricate precise-sounding numbers
Less-covered topics: Niche subjects, local information, smaller organizations — models have less training data to draw from and guess more
Legal and medical specifics: Specific laws, drug interactions, dosages — where being wrong has serious consequences

Tasks with Low Hallucination Risk

Well-documented facts: Historical events, well-known science, mathematical operations
Summarization of provided text: When the model works from content you give it, hallucination risk drops substantially
Code generation: Syntax errors are different from factual hallucination — the code either works or doesn't
Structure and formatting tasks: Reformatting data, creating templates — low hallucination risk because there are no factual claims

Reasoning Models and Hallucination

Reasoning models (o3, DeepSeek R1, Claude Extended Thinking) show interesting hallucination characteristics. They hallucinate less on complex reasoning tasks because the chain-of-thought process catches errors before the final answer. However, they can confidently pursue incorrect reasoning paths — sometimes with more detailed wrong justifications than standard models would provide.

For factual recall tasks (looking up specific facts), reasoning models are not meaningfully better than standard models at preventing hallucination.

The Citation Hallucination Problem

Citation hallucination deserves special attention because it's both common and consequential. In testing across models, citation hallucination rates for specific academic papers are surprisingly high even in frontier models:

Model	Citation Hallucination Rate
GPT-5	~5–10%
Claude 3.5 Sonnet	~8–12%
GPT-4o	~12–18%
Older models (GPT-3.5 class)	~30–50%

The core problem: models learned from text that discusses papers, so they know how citations should look and sound. When they don't have access to the actual paper, they generate a plausible-looking citation rather than admitting they don't know.

Rule: Never use AI-generated citations without verifying each one in the actual database (Google Scholar, PubMed, etc.).

Techniques to Reduce Hallucination

Retrieval-Augmented Generation (RAG)

The most effective structural intervention: provide the model with retrieved source documents and instruct it to answer only from those documents. When a model is grounding answers in content you gave it rather than memory, hallucination rates drop dramatically — often to near zero for factual questions within the provided context.

Ask for Confidence and Uncertainty

Prompt the model to indicate its confidence: "If you're not certain about a fact, say so rather than guessing." Models trained with RLHF respond to this instruction and will express uncertainty more readily when prompted. This doesn't eliminate hallucination but surfaces it.

Verification Prompts

After receiving a factual answer, ask: "How confident are you in each of the specific facts you stated? Which ones should I verify independently?" Models will often correctly identify which claims are uncertain.

Use Web Search Integration

Models with live web search integration (Perplexity, ChatGPT with search, Gemini with search) have substantially lower hallucination rates on current events and verifiable facts. The search tool grounds answers in actual sources.

Temperature and Sampling

Higher temperature settings increase creative output variability — and can increase hallucination on factual tasks. For tasks requiring factual accuracy, use lower temperatures (0.1–0.3) rather than defaults (0.7–1.0).

How to Verify AI-Generated Claims

Statistics and data: Search for the original source, not the number
Citations: Check DOI or search title directly in Google Scholar
Quotes: Search the exact quote to find the original context
Recent events: Search for news coverage from the claimed date
Expert claims: Search the person's name with the claim to verify attribution

Frequently Asked Questions

Will AI hallucination ever be solved?

Current evidence suggests hallucination can be significantly reduced but may be fundamentally difficult to eliminate from probabilistic language models. RAG, better training data, and improved RLHF continue reducing rates. Whether it reaches functionally zero is unclear. For now, human verification of high-stakes claims remains necessary.

Are smaller models worse at hallucination?

Generally yes — smaller models have less knowledge and are more likely to fill gaps with plausible-sounding guesses. But model size isn't the only factor; training methodology and RLHF significantly affect hallucination rates. Some smaller, better-trained models outperform larger poorly-trained ones.

Which AI tool has the lowest hallucination rate for research?

Perplexity.ai (with web search enabled) has the lowest effective hallucination rate for research tasks because it retrieves sources rather than relying purely on memory. Among pure language models, GPT-5 and Claude 4 Opus have the lowest hallucination rates on general factual tasks in current evaluations.

AI Hallucination Rates in 2025: Which Models Are Most Reliable?