GlossAssist -- A Tool to Simplify Corpus Creation and Study the Effect of NLP Models in Low-Resource Documentation Settings
Quick Take
GlossAssist is an innovative glossing tool leveraging CWoMP architecture to enhance interlinear glossed text (IGT) production in low-resource settings. By incorporating an active learning feedback loop, it allows linguists to improve predictions without retraining, addressing the slow and costly manual processes traditionally involved in language documentation.
Key Points
- GlossAssist automates IGT production, reducing time and costs for linguists.
- The tool uses a mutable lexicon of learned morpheme representations.
- Active learning allows for continuous improvement without model retraining.
- GlossAssist aims to bridge the gap between automated tools and linguistic expertise.
- The system is designed specifically for documentary linguists.
Article Excerpt
From source RSS / original summaryarXiv:2606. 04367v1 Announce Type: new Abstract: Interlinear glossed text (IGT) is the standard format for linguistic annotation in language documentation. Producing it manually, however, is often slow and costly. Automated glossing systems have improved substantially in recent years, but adoption among field linguists remains limited. Existing tools are designed to be evaluated rather than used, offering no interpretable path for correction or the incorporation of linguistic expertise back into model behavior.
We present GlossAssist, a glossing tool built around the retrieval-based architecture of CWoMP (Contrastive Word-Morpheme Pre-training), which grounds predictions in a mutable lexicon of learned morpheme representations. In conjunction with CWoMP, our system treats each correction by an annotator as part of an active learning setting, which expands the lexicon and improves future predictions without having to retrain the model.
In this paper, we present our interface and argue that this feedback loop should be treated as a design requirement for NLP tools aimed at documentary linguists.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.