A Modular Architecture for Typologically Controlled Lexicon Generation
Quick Take
A new modular framework for lexicon generation leverages PHOIBLE phoneme inventories and Swadesh-Leipzig-Jakarta ontology, outperforming deterministic models in phonotactic coherence and typological realism. Evaluation shows probabilistic grammars achieve better results across lexicon sizes of 100-5,000 forms.
Key Points
- Proposed framework samples phoneme inventories from PHOIBLE for lexicon generation.
- Generates word forms using interchangeable phonological grammars like OT and MaxEnt.
- Evaluation metrics include character n-gram perplexity and KL divergence.
- Probabilistic grammars consistently outperform deterministic and random baselines.
- Results show improvements in phonotactic coherence and typological realism.
Article Excerpt
From source RSS / original summaryarXiv:2605. 28824v1 Announce Type: new Abstract: Constructing artificial lexicons that are pronounceable, typologically plausible, and semantically structured remains an open challenge in computational linguistics. Existing conlang generators either lack formal phonotactic guarantees or delegate generation to opaque, non-reproducible LLM-based pipelines.
We propose a modular framework that samples phoneme inventories from PHOIBLE, generates word forms under interchangeable phonological grammars (deterministic, OT, and MaxEnt), and assigns meanings via a Swadesh--Leipzig--Jakarta ontology with explicit form--meaning alignment.
Evaluation on character $n$-gram perplexity, log-likelihood, and KL divergence against PHOIBLE across lexicon sizes of 100-5,000 forms shows that probabilistic grammars consistently outperform deterministic and random baselines on both phonotactic coherence and typological realism.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.