HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models
Quick Take
HalluWorld introduces a controlled benchmark to systematically evaluate and reduce hallucinations in language models.
Key Points
- HalluWorld defines hallucination through explicit reference worlds.
- Synthetic environments allow controlled evaluation of model performance.
- Distinct failure modes of hallucinations are identified across tasks.
📖 Reader Mode
~2 min readAbstract:Hallucination remains a central failure mode of large language models, but existing benchmarks operationalize it inconsistently across summarization, question answering, retrieval-augmented generation, and agentic interaction. This fragmentation makes it unclear whether a mitigation that works in one setting reduces hallucinations across contexts. Current benchmarks either require human annotation and fixed references that may be memorized, or rely on observations in settings that are difficult to reproduce. To study root causes, we introduce HalluWorld, an extensible benchmark grounded in an explicit reference-world formulation: a model hallucinates when it produces an observable claim that is false with respect to this world. Building on this view, we construct synthetic and semi-synthetic environments in which the reference world is fully specified, the model's view is controlled, and hallucination labels are generated automatically. HalluWorld spans gridworlds, chess, and realistic terminal tasks, enabling controlled variation of world complexity, observability, temporal change, and source-conflict policy, and disentangling hallucinations into fine-grained error categories. We evaluate frontier and open-weight language models across these settings and find consistent patterns: perceptual hallucination on directly observed information is near-solved for frontier models, while multi-step state tracking and causal forward simulation remain difficult and are not generally solved by extended thinking. In the terminal setting, models also struggle with when to abstain. The uneven profile of failures across probe types and domains suggests that hallucinations arise from distinct failure modes rather than a single capability. Our results suggest that controlled reference worlds offer a scalable and reproducible path toward measuring and reducing hallucinations in modern language models.
| Comments: | HalluWorld benchmark (code and data) at this http URL |
| Subjects: | Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML) |
| Cite as: | arXiv:2605.19341 [cs.CL] |
| (or arXiv:2605.19341v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.19341 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Steven Y. Feng [view email]
[v1]
Tue, 19 May 2026 04:29:03 UTC (10,992 KB)
— Originally published at arxiv.org
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.