When transformers learn "impossible" languages, what do they learn?
Quick Answer
This paper shows that Recent research on GPT-2 models reveals that while they show gradual degradation in grammatical sensitivity to 'impossible' languages, they significantly struggle with generative tasks, producing fewer high-quality sentences as length increases.
Quick Take
Recent research on GPT-2 models reveals that while they show gradual degradation in grammatical sensitivity to 'impossible' languages, they significantly struggle with generative tasks, producing fewer high-quality sentences as length increases. This suggests a link between model behavior and the non-attestation of such languages due to generative deficiencies.
Key Points
- GPT-2 models trained on 'impossible' English variants show gradual grammatical sensitivity degradation.
- Model performance declines are influenced by the language's information locality.
- Significant failures in generative tasks lead to fewer high-quality sentences at longer lengths.
- Results suggest generative deficiencies may explain non-attestation of impossible languages.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 30815v1 Announce Type: new Abstract: Recent work suggests that transformer language models show a bias towards human languages over unnatural ("impossible") languages argued to be unacquirable by humans. However, this literature has largely based these claims on differences in sample efficiency and test-set perplexity, rather than on direct evaluations of the linguistic capacities that could plausibly explain non-attestation in human languages.
We evaluate two theoretically motivated linking hypotheses: impossibility arising from deficiencies in grammatical sensitivity or generative production. Using GPT-2 style models trained on perturbed "impossible" variants of English, we measure sensitivity to grammaticality using BLiMP minimal pairs, finding that model performance exhibits only gradual degradation, mediated by the language's information locality.
In contrast, these models exhibited pronounced failures in generation, producing substantially fewer high-quality sentences at longer lengths. Together, these results suggest generative deficiency and transmission failures as a plausible linking hypothesis between language model behaviour and non-attestation of impossible languages.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.