What does the OpenAI hallucination paper actually prove?

The paper by OpenAI researchers (Kalai, Zhang, Nachum) and Georgia Tech's Santosh Vempala proves that hallucinations are structurally incentivized by how models are trained and evaluated. Benchmarks reward guessing over abstaining — a confident wrong answer scores higher than 'I don't know.' Until scoring systems give credit for honesty, every lab will keep building models that guess. The paper also distinguishes three unavoidable causes: knowledge gaps, architectural limits, and genuinely unanswerable questions.

Why do more capable models hallucinate more, not less?

More capable models are better at sounding right — they construct more fluent, confident, plausible-seeming answers. OpenAI's own data shows o1 hallucinated on 16% of factual questions; the next generation reached 33%; the most recent hit 48%. The model gets smarter at generating convincing text, which means it gets better at confidently fabricating things that don't exist.

Is Claude actually being used to determine bombing targets in Iran?

Yes. The Washington Post reported that the US military is using Palantir's Maven Smart System — which has Claude embedded — to determine which sites in Iran to bomb and provide analysis on strikes. This is happening while Anthropic is simultaneously suing the Pentagon over its refusal to remove guardrails against autonomous weapons use.

What is the Anthropic-Pentagon dispute about?

The Pentagon blacklisted Anthropic after it refused to remove guardrails preventing Claude from being used for autonomous weapons systems (no human in the loop) or domestic mass surveillance. Anthropic sued to block the blacklisting. Dario Amodei's stated position: even if autonomous weapons may prove necessary for national defense, today's frontier AI is not reliable enough to power them safely — and the hallucination data backs that up.

What's the fix for hallucinations according to the paper?

Change how models are graded. Current benchmarks give zero credit for 'I don't know' and reward any guess that happens to be right. The fix is scoring systems that treat abstention as better than a wrong answer. GPT-5-thinking-mini demonstrates this: it abstains 52% of the time and has a 26% error rate. o4-mini abstains 1% of the time and has a 75% error rate. Same accuracy, dramatically different error profiles.

What does this mean for enterprise AI deployment?

Any system where a confident wrong answer is worse than no answer — medical diagnosis, legal research, financial analysis, weapons targeting — needs to be built around abstention, not accuracy. Design for 'I don't know' as a first-class output. Treat high-stakes AI outputs as drafts requiring human verification, not authoritative answers. The math says you cannot assume reliability; you can only build systems that fail gracefully.

AI Hallucinations Are Mathematically Inevitable. We're Using AI to Decide What to Bomb.

By Prahlad Menon Published 2026-03-14 6 min read

Two stories are running in parallel this week. Most coverage is treating them separately. They’re the same story.

Story one: Researchers from OpenAI and Georgia Tech published a paper proving that language model hallucinations are not a bug to be engineered away — they’re a structural consequence of how models are trained and evaluated. Mathematically, models that guess score better than models that say “I don’t know,” so every lab in the world is building models that guess.

Story two: The US military is using Palantir’s Maven Smart System — with Claude embedded — to determine which sites in Iran to bomb and analyze the results of those strikes. Simultaneously, the Pentagon blacklisted Anthropic for refusing to remove guardrails against autonomous weapons. Anthropic is suing.

Put these together: we are deploying AI systems whose hallucination rate is mathematically guaranteed to increase as they get more capable, in life-and-death decision contexts, while the company that built the AI is in court arguing that its own models are not reliable enough to make targeting decisions without human oversight.

The paper

The September 2025 paper — “Why Language Models Hallucinate” — was authored by Adam Tauman Kalai, Edwin Zhang, and Ofir Nachum from OpenAI, with Georgia Tech’s Santosh Vempala. OpenAI published a summary; the full paper is on arXiv.

The core argument: hallucinations persist because benchmark scoring rewards guessing.

Think about it like a multiple-choice exam. A student who doesn’t know the answer but guesses has a nonzero chance of being right. Leaving it blank guarantees zero. Models trained and evaluated on accuracy — percentage of questions answered correctly — face the same incentive structure. A model that guesses “September 10” as someone’s birthday has a 1-in-365 chance of being right. A model that says “I don’t know” scores zero. Over thousands of questions, the guessing model looks better on every leaderboard.

The paper identifies three categories of unavoidable hallucination:

Knowledge gaps — topics underrepresented in training data, where the model fills blanks
Architectural limits — certain reasoning problems that today’s architectures genuinely cannot solve
Unanswerable questions — things no model will ever be able to verify, regardless of capability

The capability paradox

OpenAI’s own data is the most damning part. Comparing models on factual recall:

Model	Hallucination rate
o1 (earliest reasoning model)	16%
Next generation	33%
Most recent release	48%

Each generation got more capable. Each generation hallucinated more. The explanation: more capable models are better at constructing fluent, confident, plausible-sounding text — which means they’re better at confidently fabricating things that don’t exist.

The fix, per the paper, is a scoring system that treats abstention as better than a wrong answer. OpenAI’s own data shows what this looks like:

Model	Abstention rate	Accuracy	Error rate
GPT-5-thinking-mini	52%	22%	26%
o4-mini	1%	24%	75%

Same accuracy. o4-mini is wrong three times as often because it almost never says “I don’t know.” GPT-5-thinking-mini says it doesn’t know more than half the time, and when it answers, it’s right at a much higher rate. The benchmark would rank o4-mini higher. The benchmark is the problem.

Claude in Iran

While this paper was circulating, The Washington Post reported that the US military is using Palantir’s Maven Smart System — with Claude embedded in it — to determine which sites in Iran to bomb and provide analysis of strike outcomes.

This is not a hypothetical. This is operational use of an LLM with a documented, mathematically-explained, increasing hallucination rate to support targeting decisions in an active conflict.

Anthropic’s response is notable. In a statement to the Department of Defense, they wrote:

“Even fully autonomous weapons (those that take humans out of the loop entirely and automate selecting and engaging targets) may prove critical for our national defense. But today, frontier AI systems are simply not reliable enough to power fully autonomous weapons. We will not knowingly provide a product that puts America’s warfighters and civilians at risk.”

This is Anthropic using the hallucination argument as a safety guardrail in real time. The Pentagon disagrees. The Pentagon blacklisted them. Anthropic sued.

Dario Amodei’s broader position — outlined in a January essay — is more nuanced than the headlines suggest: he argues democratic governments should have access to the most advanced AI to counter autocratic adversaries, including for military use. His line is drawn at autonomous weapons specifically, and at the current reliability of frontier models. The hallucination paper is, in a sense, his evidence.

The contradiction at the center of the industry

Here is the contradiction that the industry has not resolved:

Every lab publishes safety research documenting the reliability limits of their models. Every lab also deploys those models in increasingly high-stakes contexts. OpenAI publishes a paper proving hallucinations are structurally incentivized, then ships models embedded in enterprise and government workflows where a confident wrong answer has real consequences.

Anthropic refuses to enable autonomous weapons targeting but is already embedded in the system doing human-supervised targeting. The line between “human in the loop” and “human rubber-stamping AI output too fast to meaningfully verify” is not a bright one in an active conflict.

What this means for builders

If you are building AI systems for any high-stakes application — medical, legal, financial, operational — the paper has a concrete design implication:

“I don’t know” must be a first-class output, not a failure mode.

Systems that penalize abstention — that force an answer when the model is uncertain — are building the hallucination problem in by design. Systems that reward abstention, that treat “I’m not confident enough to answer” as a valid and useful response, get dramatically lower error rates with comparable accuracy.

For enterprise AI deployment specifically:

Design verification workflows assuming some outputs are wrong, not assuming they’re right
Never surface a single AI output as authoritative without a confidence signal
Treat domains with sparse training coverage (novel legal situations, rare medical conditions, recent events) as high-hallucination-risk zones
Build abstention into your evaluation metrics — a system that declines to answer 20% of the time but is right 95% when it does answer is more valuable than one that always answers and is right 70% of the time

The math is settled. Hallucinations are not going away. The only variable is whether systems are designed to fail gracefully or to fail confidently.

Sources: OpenAI — Why Language Models Hallucinate · arXiv paper · Reuters — Anthropic sues Pentagon · The Guardian — Claude in Iran · Anthropic statement