A Modular Architecture for Typologically… · DeepSignal

A Modular Architecture for Typologically Controlled Lexicon Generation

arXiv cs.CL·Sankalp Tattwadarshi Swain, Dhruv Kumar

1d ago

·~1 min·5/29/2026·en·1

Quick Take

A new modular framework for lexicon generation leverages PHOIBLE phoneme inventories and Swadesh-Leipzig-Jakarta ontology, outperforming deterministic models in phonotactic coherence and typological realism. Evaluation shows probabilistic grammars achieve better results across lexicon sizes of 100-5,000 forms.

Key Points

Proposed framework samples phoneme inventories from PHOIBLE for lexicon generation.
Generates word forms using interchangeable phonological grammars like OT and MaxEnt.
Evaluation metrics include character n-gram perplexity and KL divergence.
Probabilistic grammars consistently outperform deterministic and random baselines.
Results show improvements in phonotactic coherence and typological realism.

Article Excerpt

From source RSS / original summary

arXiv:2605. 28824v1 Announce Type: new Abstract: Constructing artificial lexicons that are pronounceable, typologically plausible, and semantically structured remains an open challenge in computational linguistics. Existing conlang generators either lack formal phonotactic guarantees or delegate generation to opaque, non-reproducible LLM-based pipelines.

We propose a modular framework that samples phoneme inventories from PHOIBLE, generates word forms under interchangeable phonological grammars (deterministic, OT, and MaxEnt), and assigns meanings via a Swadesh--Leipzig--Jakarta ontology with explicit form--meaning alignment.

Evaluation on character $n$-gram perplexity, log-likelihood, and KL divergence against PHOIBLE across lexicon sizes of 100-5,000 forms shows that probabilistic grammars consistently outperform deterministic and random baselines on both phonotactic coherence and typological realism.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

A Modular Architecture for Typologically Controlled Lexicon Generation

Quick Take

Key Points

Article Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs

In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective