Discovering Lexical Gaps Using Embeddings from Multilingual LLMs

arXiv cs.CL·Yoonwon Jung, Aaron S. Cohen, Benjamin K. Bergen

4d ago

·~1 min·5/26/2026·en·1

Quick Take

A new data-driven framework identifies lexical gaps in multilingual contexts using embeddings from Korean-English bilingual LLMs. The method shows that gap words have weaker semantic alignment, achieving AUCs of 0.81 and 0.76 for Korean-to-English and English-to-Korean, respectively, providing a scalable, taxonomy-free approach for lexical gap detection.

Key Points

Extracted embeddings from Korean-English bilingual LLMs for translation pairs.
Gap words showed weaker semantic alignment in 94% and 97% of cases.
Logistic classifiers achieved AUCs of 0.81 and 0.76 for gap detection.
Identified 18 out of 19 Korean and 26 out of 27 English gap words.
Provides a language-agnostic method for scalable lexical gap identification.

Article Content

From source RSS / original summary

arXiv:2605. 24310v1 Announce Type: new Abstract: Lexical gaps are words that do not exist in certain languages. They pose challenges for building multilingual lexical resources, for machine translation, and for cross-lingual transfer. Existing lexical gap detection relies on human judgments or fixed conceptual taxonomies. We propose a data-driven framework for identifying cross-lingual lexical gaps.

We extracted contextualized embeddings from Korean-English bilingual LLMs for Korean-to-English and English-to-Korean translation pairs. Combinations of LLMs, embedding types, dimensionality, and orthogonal transformations across 100 train-test splits yielded 4000 distinct embedding spaces in each source language. In each space, we computed the semantic similarity between each source word and its nearest neighbor in the target language, and compared their distribution for gap words versus non-gap words.

In 94% (Korean-to-English) and 97% (English-to-Korean) of embedding spaces, gap words showed weaker cross-lingual semantic alignment than non-gap words. Logistic classifiers trained on unaligned embedding spaces can reliably separate gap words from non-gap words, achieving AUCs of 0. 81 (Korean-to-English) and 0. 76 (English-to-Korean) and retrieving 18/19 Korean and 26/27 English gap words. This approach provides a language-agnostic and taxonomy-free method for scalable lexical gap identification.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

Discovering Lexical Gaps Using Embeddings from Multilingual LLMs

Quick Take

Key Points

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs

In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective