Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories
Quick Take
A study of 20,000 LLM-generated stories reveals a striking lack of diversity, with 11 words appearing in 88.3% of them, suggesting that small datasets and alignment algorithms significantly influence output variability. Notably, these 'lighthouse' stories are less frequent than typical post-training narratives, which often reference copyrighted material.
Key Points
- 11 words dominate 88.3% of LLM-generated stories, indicating low variability.
- The study sampled stories from four current models using five prompts.
- Common words include names like Elias and settings like lighthouses.
- These tokens are rarely found in published literature or pre-training data.
- The findings highlight the impact of small datasets on model outputs.
Article Excerpt
From source RSS / original summaryarXiv:2605. 26492v1 Announce Type: new Abstract: LLM-generated stories are a popular use case, but they show very low variability. We sample 20,000 total stories from four current models using five prompts. We find that 11 words occur in 88. 3% of generated stories, with little difference between models. These words include names (Elias, Mara, Elara), settings (lighthouses), and professions (clockmaker, librarian).
These tokens do not often occur in published literature nor pre-training data, but they are found in preference data that is likely to have been used by all current models. Surprisingly, these "lighthouse" stories are infrequent when compared with the average post-training story, much of which contains references to copyrighted characters or adult content. This result demonstrates the potentially disproportionate impact of small datasets combined with powerful alignment algorithms.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.