Elias in the Lighthouse, Again? Diagnosing Low… · DeepSignal

Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories

arXiv cs.CL·Sil Hamilton, David Mimno

3d ago

·~1 min·5/27/2026·en·1

Quick Take

A study of 20,000 LLM-generated stories reveals a striking lack of diversity, with 11 words appearing in 88.3% of them, suggesting that small datasets and alignment algorithms significantly influence output variability. Notably, these 'lighthouse' stories are less frequent than typical post-training narratives, which often reference copyrighted material.

Key Points

11 words dominate 88.3% of LLM-generated stories, indicating low variability.
The study sampled stories from four current models using five prompts.
Common words include names like Elias and settings like lighthouses.
These tokens are rarely found in published literature or pre-training data.
The findings highlight the impact of small datasets on model outputs.

Article Excerpt

From source RSS / original summary

arXiv:2605. 26492v1 Announce Type: new Abstract: LLM-generated stories are a popular use case, but they show very low variability. We sample 20,000 total stories from four current models using five prompts. We find that 11 words occur in 88. 3% of generated stories, with little difference between models. These words include names (Elias, Mara, Elara), settings (lighthouses), and professions (clockmaker, librarian).

These tokens do not often occur in published literature nor pre-training data, but they are found in preference data that is likely to have been used by all current models. Surprisingly, these "lighthouse" stories are infrequent when compared with the average post-training story, much of which contains references to copyrighted characters or adult content. This result demonstrates the potentially disproportionate impact of small datasets combined with powerful alignment algorithms.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories

Quick Take

Key Points

Article Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs

In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective