Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models
Quick Take
The study presents five parameter alignment strategies to mitigate catastrophic forgetting in multilingual expert language models during continual pretraining. These strategies, including hard layer freezing and soft regularization, significantly reduce forgetting while maintaining language acquisition, with post-hoc weight reversion yielding the best translation performance. The results provide actionable guidelines for deploying these methods across various language tasks.
Key Points
- Five layer-aware parameter alignment strategies were tested against two unregularized CPT baselines.
- The study involved 32 training languages from five language families and evaluated on four axes.
- Layer freezing and regularization best preserved comprehension, while post-hoc reversion improved translation.
- Parameter alignment strategies significantly reduce forgetting with minimal impact on language acquisition.
- Results offer practical deployment guidelines for family-expert continual pretraining.
Article Content
From source RSS / original summaryarXiv:2606. 00284v1 Announce Type: new Abstract: While continual pretraining~(CPT) is a practical way to extend large language models to new languages, na\"ive finetuning on targeted data erodes existing capabilities through catastrophic forgetting. Organizing training around language families reduces cross-language interference but cannot alone prevent forgetting of the general knowledge needed for downstream tasks.
We link this forgetting to parameter drift in multilingual CPT and present a suite of five layer-aware parameter alignment strategies: hard layer freezing, soft regularization, post-hoc weight reversion, and model merging. We systematically compare our alignment strategies against two unregularized CPT baselines on benchmarks spanning 32 training languages from five language families, plus held-out languages, across four evaluation axes: perplexity, reading comprehension, physical reasoning, and translation.
Parameter alignment substantially reduces forgetting at minimal cost to language acquisition: layer freezing and regularization best preserve comprehension, whereas post-hoc reversion yields the strongest translation gains. Together, these results map the acquisition--forgetting frontier for family-expert CPT and offer practical deployment guidelines pairing each strategy to the tasks it best serves.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.