Hallucination Is a Calibration Failure

The dominant explanation for AI hallucination — that models confabulate because they do not "know" the answer — is incomplete. It locates the problem in a knowledge deficit that better training data would solve. The literature suggests a different diagnosis: hallucination is a structural consequence of how language models are trained, compounded by the absence of any mechanism to separate confidence from accuracy after generation.[1]

Three failure modes account for the bulk of observable hallucination. Each has a different cause, a different location in the inference pipeline, and a different class of potential fix.

Training Objective Misalignment

Language models are trained to predict the next token by minimizing cross-entropy loss against a data distribution. This objective does not penalize confident wrong answers — it penalizes low-probability outputs. A model that guesses fluently when uncertain outperforms a model that abstains, because abstention is never rewarded.

Systematic Overconfidence

When confidence is elicited from LLMs, it is poorly calibrated in a specific direction: overconfidence. More critically, unlike humans, LLMs do not adjust confidence based on past performance. The calibration error is structural, not incidental.

III

No External Truth Check

In standard deployment, the same model that generates a claim also implicitly evaluates it. There is no structural separation between the generation process and the verification process. The model cannot be surprised by its own output.

Failure Mode I

Training Objectives That Reward Guessing

Wu et al. (2025) provide the clearest formal account of why hallucination persists despite scale.[1] Standard reinforcement learning from human feedback uses binary reward signals: correct or incorrect. Under this scheme, a model rationally guesses whenever its estimated probability of correctness exceeds zero — because any positive probability of reward outweighs the cost of abstention, which carries no reward at all. The training objective inadvertently optimizes for fluent guessing.

Core finding — Wu et al. (2025)

Hallucination is not merely stochastic error but a predictable statistical consequence of training objectives that prioritize mimicking data distribution over epistemic honesty.

The proposed correction — training on strictly proper scoring rules that reward calibrated abstention — produced measurable results: a 4B model trained this way surpassed GPT-5 on uncertainty quantification benchmarks and achieved zero-shot calibration error on par with frontier models on factual QA, despite substantially lower raw factual accuracy.[1] This result isolates calibration as a separable skill from knowledge — a distinction that matters for any intervention applied post-generation.

Failure Mode II

Overconfidence Without Performance Feedback

Cash et al. (2025) measured LLM confidence accuracy across five task domains — NFL predictions, Oscar predictions, Pictionary, trivia, and domain-specific knowledge questions.[2] LLMs achieved similar absolute and relative metacognitive accuracy to humans. They were also similarly overconfident. The critical asymmetry: humans adjust confidence based on past performance feedback; LLMs do not. Confidence is generated token-by-token in the same pass as content — there is no feedback loop from prior errors.

Lin et al. (2022) demonstrated that calibrated uncertainty expression is learnable.[3] GPT-3, when explicitly trained to output verbalized probabilities alongside answers, produced well-calibrated confidence estimates that generalized under distribution shift. The implication is that default model outputs are not calibrated — calibration requires either additional training or an external calibration step applied after generation.

Failure Mode III

Semantic Entropy and the Absent Verifier

Farquhar et al. (2024) published the most widely cited technical approach to hallucination detection, appearing in Nature with over 1,100 citations.[4] Their central observation: the same model that generates a claim is the only available source for evaluating it, creating a closed loop that cannot surface confabulations from the inside. Semantic entropy — measuring uncertainty at the level of meaning rather than token sequences — breaks this loop statistically without requiring external knowledge.

Core finding — Farquhar et al. (2024)

Computing uncertainty over meaning rather than token sequences detects confabulations robustly across tasks, without task-specific data or a priori knowledge of the domain.

Semantic entropy is effective but detects a specific subclass of hallucination: confabulations produced under high uncertainty. It does not detect plausible-sounding errors where the model generates with high confidence. This failure mode — the fluent, confident wrong answer — is precisely what training objective misalignment produces.

Post-Generation Verification

When External Verification Outperforms Internal Detection

Zhou (2025) surveyed the full intervention landscape and reached a practical conclusion: post-hoc verification is the appropriate strategy for long-form generation and documents that cite external sources.[5] Retrieval-augmented generation and chain-of-thought reasoning address multi-step errors during generation. Post-hoc methods — RARR, Chain-of-Verification, CRITIC, Reflexion — apply verification after the full output exists.

Boutra et al. (2025) demonstrated post-hoc knowledge graph verification improving factual accuracy consistently across two biomedical datasets, without retraining the base model.[6] The mechanism: extract factual claims, reformulate as structured queries, validate against a curated knowledge source, flag or revise unsupported claims. The framework is lightweight and generalizable.

Valentin et al. (2024) addressed the practical problem of deploying hallucination detection in production.[7] Individual scoring methods (semantic similarity, NLI-based scores, self-consistency) each perform best on different task types. A multi-scoring framework that calibrates and combines methods achieves consistent top performance across all datasets. Calibrating individual scores before thresholding is identified as critical for risk-aware decision-making.

Mechanism Analysis

Where Bayesian Evidence Scoring Operates

BayesCore operates post-generation. It does not modify the model, retrain on calibration objectives, or retrieve external facts during generation. Its mechanism is: claim extraction → evidence grounding → Bayesian posterior aggregation. Each extracted claim receives a confidence weight proportional to the evidence supporting it. The document score is the sum of weighted confidences, normalized to 100.

This positions BayesCore against two of the three failure modes identified above.

Failure Mode

Existing Interventions

BayesCore Mechanism

I — Training misalignment

Behaviorally calibrated RL, proper scoring rules

Not applicable — post-generation only

II — Overconfidence

Verbalized probability training, uncertainty-aware training

External calibration: document score is structurally calibrated by evidence weight, not by model self-report

III — No external check

Semantic entropy, RAG, CoVe, knowledge graph verification

Claim-level Bayesian scoring against evidence provides the external verifier the generation process cannot supply

The structural difference between BayesCore and binary hallucination detectors is the output type. Semantic entropy and NLI-based detectors produce a flag: hallucination present or absent. BayesCore produces a calibrated posterior over the full document: a continuous score representing the aggregate evidence-grounding quality. This matters for long-form documents, where hallucinated and well-grounded claims coexist in the same output. A binary flag cannot distinguish a document that is 90% grounded from one that is 10% grounded. A weighted posterior can.

Posterior

What This Framework Predicts — and What It Requires

The theoretical prediction is specific: for long-form AI-generated documents where hallucinated and grounded claims are intermixed, claim-level Bayesian evidence scoring will produce better-calibrated risk estimates than binary hallucination detectors or semantic entropy applied to the document as a whole. The mechanism is sound — it directly addresses the absence of external verification (Failure Mode III) and replaces model-reported confidence with evidence-derived weights (Failure Mode II).

What this framework does not yet provide is empirical validation. The prediction requires testing against a hallucination benchmark — FActScore or SimpleQA — with BayesCore scores compared against semantic entropy baselines and BERT-F1 factuality metrics. That comparison is the next study.

The honest assessment: post-hoc Bayesian evidence scoring is a theoretically well-motivated intervention that the current literature supports but has not yet tested at the document level. Publishing this claim as a theoretical framework, with explicit acknowledgment of what empirical validation remains to be done, is the appropriate way to enter the research record.

References

1Wu, J. et al. (2025). Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning. ArXiv. 6 citations.

2Cash, T. N. et al. (2025). Quantifying uncert-AI-nty: Testing the accuracy of LLMs' confidence judgments. Memory & Cognition. 16 citations.

3Lin, S. C. et al. (2022). Teaching Models to Express Their Uncertainty in Words. Transactions on Machine Learning Research. 667 citations.

4Farquhar, S. et al. (2024). Detecting hallucinations in large language models using semantic entropy. Nature. ~1,173 citations.

5Zhou, B. (2025). Detection and Mitigation of Factual Hallucinations in Large Language Models: A Comparative Review. Applied and Computational Engineering. 0 citations.

6Boutra, I. et al. (2025). Post-hoc Verification of LLMs with Knowledge Graphs for Medical Question Answering. ICRAMI 2025. 0 citations.

7Valentin, S. et al. (2024). Cost-Effective Hallucination Detection for LLMs. ArXiv. 19 citations.

Citation counts retrieved May 2026 via Consensus. ~-prefixed counts are approximate. All papers independently accessible via DOI or arXiv. Claims in this paper correspond to the abstracts and findings sections of cited works. Full-text verification recommended before citing in academic contexts.

BayesCore

Run a document through the scoring engine

Upload any AI-generated document. BayesCore extracts claims, scores them against evidence, and returns a calibrated posterior — the same mechanism described in this paper.

Evaluate free →