AI Hallucinations Are Mathematically Inevitable. We're Using AI to Decide What to Bomb.
Two stories are running in parallel this week. Most coverage is treating them separately. They’re the same story.
Story one: Researchers from OpenAI and Georgia Tech published a paper proving that language model hallucinations are not a bug to be engineered away — they’re a structural consequence of how models are trained and evaluated. Mathematically, models that guess score better than models that say “I don’t know,” so every lab in the world is building models that guess.
Story two: The US military is using Palantir’s Maven Smart System — with Claude embedded — to determine which sites in Iran to bomb and analyze the results of those strikes. Simultaneously, the Pentagon blacklisted Anthropic for refusing to remove guardrails against autonomous weapons. Anthropic is suing.
Put these together: we are deploying AI systems whose hallucination rate is mathematically guaranteed to increase as they get more capable, in life-and-death decision contexts, while the company that built the AI is in court arguing that its own models are not reliable enough to make targeting decisions without human oversight.
The paper
The September 2025 paper — “Why Language Models Hallucinate” — was authored by Adam Tauman Kalai, Edwin Zhang, and Ofir Nachum from OpenAI, with Georgia Tech’s Santosh Vempala. OpenAI published a summary; the full paper is on arXiv.
The core argument: hallucinations persist because benchmark scoring rewards guessing.
Think about it like a multiple-choice exam. A student who doesn’t know the answer but guesses has a nonzero chance of being right. Leaving it blank guarantees zero. Models trained and evaluated on accuracy — percentage of questions answered correctly — face the same incentive structure. A model that guesses “September 10” as someone’s birthday has a 1-in-365 chance of being right. A model that says “I don’t know” scores zero. Over thousands of questions, the guessing model looks better on every leaderboard.
The paper identifies three categories of unavoidable hallucination:
- Knowledge gaps — topics underrepresented in training data, where the model fills blanks
- Architectural limits — certain reasoning problems that today’s architectures genuinely cannot solve
- Unanswerable questions — things no model will ever be able to verify, regardless of capability
The capability paradox
OpenAI’s own data is the most damning part. Comparing models on factual recall:
| Model | Hallucination rate |
|---|---|
| o1 (earliest reasoning model) | 16% |
| Next generation | 33% |
| Most recent release | 48% |
Each generation got more capable. Each generation hallucinated more. The explanation: more capable models are better at constructing fluent, confident, plausible-sounding text — which means they’re better at confidently fabricating things that don’t exist.
The fix, per the paper, is a scoring system that treats abstention as better than a wrong answer. OpenAI’s own data shows what this looks like:
| Model | Abstention rate | Accuracy | Error rate |
|---|---|---|---|
| GPT-5-thinking-mini | 52% | 22% | 26% |
| o4-mini | 1% | 24% | 75% |
Same accuracy. o4-mini is wrong three times as often because it almost never says “I don’t know.” GPT-5-thinking-mini says it doesn’t know more than half the time, and when it answers, it’s right at a much higher rate. The benchmark would rank o4-mini higher. The benchmark is the problem.
Claude in Iran
While this paper was circulating, The Washington Post reported that the US military is using Palantir’s Maven Smart System — with Claude embedded in it — to determine which sites in Iran to bomb and provide analysis of strike outcomes.
This is not a hypothetical. This is operational use of an LLM with a documented, mathematically-explained, increasing hallucination rate to support targeting decisions in an active conflict.
Anthropic’s response is notable. In a statement to the Department of Defense, they wrote:
“Even fully autonomous weapons (those that take humans out of the loop entirely and automate selecting and engaging targets) may prove critical for our national defense. But today, frontier AI systems are simply not reliable enough to power fully autonomous weapons. We will not knowingly provide a product that puts America’s warfighters and civilians at risk.”
This is Anthropic using the hallucination argument as a safety guardrail in real time. The Pentagon disagrees. The Pentagon blacklisted them. Anthropic sued.
Dario Amodei’s broader position — outlined in a January essay — is more nuanced than the headlines suggest: he argues democratic governments should have access to the most advanced AI to counter autocratic adversaries, including for military use. His line is drawn at autonomous weapons specifically, and at the current reliability of frontier models. The hallucination paper is, in a sense, his evidence.
The contradiction at the center of the industry
Here is the contradiction that the industry has not resolved:
Every lab publishes safety research documenting the reliability limits of their models. Every lab also deploys those models in increasingly high-stakes contexts. OpenAI publishes a paper proving hallucinations are structurally incentivized, then ships models embedded in enterprise and government workflows where a confident wrong answer has real consequences.
Anthropic refuses to enable autonomous weapons targeting but is already embedded in the system doing human-supervised targeting. The line between “human in the loop” and “human rubber-stamping AI output too fast to meaningfully verify” is not a bright one in an active conflict.
What this means for builders
If you are building AI systems for any high-stakes application — medical, legal, financial, operational — the paper has a concrete design implication:
“I don’t know” must be a first-class output, not a failure mode.
Systems that penalize abstention — that force an answer when the model is uncertain — are building the hallucination problem in by design. Systems that reward abstention, that treat “I’m not confident enough to answer” as a valid and useful response, get dramatically lower error rates with comparable accuracy.
For enterprise AI deployment specifically:
- Design verification workflows assuming some outputs are wrong, not assuming they’re right
- Never surface a single AI output as authoritative without a confidence signal
- Treat domains with sparse training coverage (novel legal situations, rare medical conditions, recent events) as high-hallucination-risk zones
- Build abstention into your evaluation metrics — a system that declines to answer 20% of the time but is right 95% when it does answer is more valuable than one that always answers and is right 70% of the time
The math is settled. Hallucinations are not going away. The only variable is whether systems are designed to fail gracefully or to fail confidently.
Related: McKinsey’s Lilli Got Hacked in 2 Hours · AGENTS.md Is an Attack Surface · Crust — Security Gateway for AI Agents
Sources: OpenAI — Why Language Models Hallucinate · arXiv paper · Reuters — Anthropic sues Pentagon · The Guardian — Claude in Iran · Anthropic statement