Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models

arXiv cs.CL·Sanchit Ahuja, Terra Blevins

6/2/2026

·~1 min·6/2/2026·en·2

Quick Answer

The study presents five parameter alignment strategies to mitigate catastrophic forgetting in multilingual expert language models during continual pretraining.

Quick Take

The study presents five parameter alignment strategies to mitigate catastrophic forgetting in multilingual expert language models during continual pretraining. These strategies, including hard layer freezing and soft regularization, significantly reduce forgetting while maintaining language acquisition, with post-hoc weight reversion yielding the best translation performance. The results provide actionable guidelines for deploying these methods across various language tasks.

Key Points

Five layer-aware parameter alignment strategies were tested against two unregularized CPT baselines.
The study involved 32 training languages from five language families and evaluated on four axes.
Layer freezing and regularization best preserved comprehension, while post-hoc reversion improved translation.
Parameter alignment strategies significantly reduce forgetting with minimal impact on language acquisition.
Results offer practical deployment guidelines for family-expert continual pretraining.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 00284v1 Announce Type: new Abstract: While continual pretraining~(CPT) is a practical way to extend large language models to new languages, na\"ive finetuning on targeted data erodes existing capabilities through catastrophic forgetting. Organizing training around language families reduces cross-language interference but cannot alone prevent forgetting of the general knowledge needed for downstream tasks.

We link this forgetting to parameter drift in multilingual CPT and present a suite of five layer-aware parameter alignment strategies: hard layer freezing, soft regularization, post-hoc weight reversion, and model merging. We systematically compare our alignment strategies against two unregularized CPT baselines on benchmarks spanning 32 training languages from five language families, plus held-out languages, across four evaluation axes: perplexity, reading comprehension, physical reasoning, and translation.

Parameter alignment substantially reduces forgetting at minimal cost to language acquisition: layer freezing and regularization best preserve comprehension, whereas post-hoc reversion yields the strongest translation gains. Together, these results map the acquisition--forgetting frontier for family-expert CPT and offer practical deployment guidelines pairing each strategy to the tasks it best serves.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Miguel Arana-Catania, Catherine Conisbee, Matthew Kidd

4d ago

FeaturedOriginal

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

AI Summary

The study evaluates three NLP approaches—Named Entity Recognition, Keyword Extraction, and Topic Modelling—using the Their Finest Hour Online Archive to automate keyword extraction from crowdsourced WWII collections. Findings suggest that while NLP methods show promise, no single approach is sufficient, and ethical considerations in automated keyword extraction are crucial for responsible stewardship.

#AI Coding #Inference #Open Source #Policy

Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Quantifying Prior Dominance in Systems