Speculative Decoding Across Languages
Quick Take
Speculative decoding enhances LLM inference speed but struggles with multilingual tasks. This study evaluates three strategies to improve efficiency across eleven languages, revealing that task-specific fine-tuning boosts efficiency but generalizes poorly, while n-gram models offer faster generation despite lower acceptance rates.
Key Points
- Three strategies evaluated: task-specific fine-tuning, monolingual corpora fine-tuning, and n-gram models.
- Task-specific fine-tuning significantly improves efficiency but generalizes poorly to new tasks.
- N-gram draft models provide faster generation speeds despite lower acceptance rates.
- Study covers eleven languages, focusing on translation and story generation tasks.
- Speculative decoding remains less effective for non-English language generation.
Article Excerpt
From source RSS / original summaryarXiv:2605. 30580v1 Announce Type: new Abstract: Speculative decoding has become a crucial component of large language model (LLM) inference, enabling faster generation by drafting multiple tokens and verifying them in parallel. However, small draft models tend to suffer from disproportionately poor multilingual capabilities. Thus, when generating text in a non-English language, speculative decoding is far less effective.
We compare three strategies to improve speculative decoding efficiency for eleven languages: finetuning the draft model on task-specific data (translation); finetuning the draft model on unlabeled monolingual corpora; and training simple n-gram draft models on the same monolingual corpora. We evaluate efficiency on translation (from English into the target language) and the held-out task of story generation.
We find that while task-specific distillation can significantly improve efficiency, distilled models generalize poorly to a new task. Meanwhile, n-gram draft models, despite lower acceptance rates, consistently provide large speed-ups due to much faster draft generation.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.