Generating in the Limit with Infinitely Many Hallucinations
Quick Answer
The paper introduces a new model for language generation in the limit, emphasizing a recall-precision trade-off.
Quick Take
The paper introduces a new model for language generation in the limit, emphasizing a recall-precision trade-off. It allows for infinitely many mistakes as long as their frequency approaches zero, potentially increasing recall when a significant portion of the target language is withheld. This approach aims to better align with the realities faced by large language models in generating valid, unseen strings.
Key Points
- Introduces a precision concept in language generation, addressing recall-precision trade-off.
- Allows infinitely many mistakes if their frequency tends to zero, maintaining precision.
- Increases recall when adversaries withhold significant portions of the target language.
- Explores a continuous relaxation of novelty constraints for language outputs.
- Aims to model realistic language generation with controlled error and repetition rates.
Paper Resources
📖 Reader Mode
~2 min readAbstract:The classic paradigm of language identification in the limit models learning as a game between an adversary, who reveals strings from an unknown target language, and a learner tasked with identifying that language. The recently introduced framework of language generation in the limit shifted the objective to better reflect modern language modeling, requiring the learner to produce valid, unseen strings from the target language. Related work highlighted a fundamental tension: a broad coverage of the target often comes at the cost of validity. We introduce a new notion of precision and recast this problem as the classic recall-precision trade-off. We analyze generation in the limit under varying constraints on enumeration, novelty, and validity, aimed at reflecting settings closer to those encountered by large language models. A key contribution is our analysis of learners that are not eventually valid: we allow infinitely many mistakes, provided their frequency tends to zero so that precision remains one. We show that this relaxation can strictly increase recall when the adversary permanently withholds a large portion of the target language. We also study a continuous relaxation of the novelty constraint that requires only a fixed fraction of outputs to be novel. Taken together, our results move toward a more realistic model of language generation where occasional errors and repetitions are unavoidable, but their rates are controlled.
| Subjects: | Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG) |
| Cite as: | arXiv:2606.28354 [cs.CL] |
| (or arXiv:2606.28354v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2606.28354 arXiv-issued DOI via DataCite |
Submission history
From: Irene Strauss [view email]
[v1]
Mon, 8 Jun 2026 09:58:13 UTC (60 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.