HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models

arXiv cs.CL·Emmy Liu, Varun Gangal, Michael Yu, Zhuofu Tao, Karan Singh, Sachin Kumar, Steven Y. Feng

17h ago

·~2 min·5/20/2026·en·1

Quick Take

HalluWorld introduces a controlled benchmark to systematically evaluate and reduce hallucinations in language models.

Key Points

HalluWorld defines hallucination through explicit reference worlds.
Synthetic environments allow controlled evaluation of model performance.
Distinct failure modes of hallucinations are identified across tasks.

📖 Reader Mode

~2 min read

[Submitted on 19 May 2026]

View PDF HTML (experimental)

Abstract:Hallucination remains a central failure mode of large language models, but existing benchmarks operationalize it inconsistently across summarization, question answering, retrieval-augmented generation, and agentic interaction. This fragmentation makes it unclear whether a mitigation that works in one setting reduces hallucinations across contexts. Current benchmarks either require human annotation and fixed references that may be memorized, or rely on observations in settings that are difficult to reproduce. To study root causes, we introduce HalluWorld, an extensible benchmark grounded in an explicit reference-world formulation: a model hallucinates when it produces an observable claim that is false with respect to this world. Building on this view, we construct synthetic and semi-synthetic environments in which the reference world is fully specified, the model's view is controlled, and hallucination labels are generated automatically. HalluWorld spans gridworlds, chess, and realistic terminal tasks, enabling controlled variation of world complexity, observability, temporal change, and source-conflict policy, and disentangling hallucinations into fine-grained error categories. We evaluate frontier and open-weight language models across these settings and find consistent patterns: perceptual hallucination on directly observed information is near-solved for frontier models, while multi-step state tracking and causal forward simulation remain difficult and are not generally solved by extended thinking. In the terminal setting, models also struggle with when to abstain. The uneven profile of failures across probe types and domains suggests that hallucinations arise from distinct failure modes rather than a single capability. Our results suggest that controlled reference worlds offer a scalable and reproducible path toward measuring and reducing hallucinations in modern language models.

Comments:	HalluWorld benchmark (code and data) at this http URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:2605.19341 [cs.CL]
	(or arXiv:2605.19341v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.19341 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Steven Y. Feng [view email]
[v1] Tue, 19 May 2026 04:29:03 UTC (10,992 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

Related in this space

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets