Dustin: Draft-Augmented Sparse Verification for Efficient Long-Context Generation with Speculative Decoding
Quick Answer
Dustin introduces a sparse verification framework for long-context speculative decoding, achieving a 27.85x speedup in self-attention and a 9.17x end-to-end decoding speedup on Qwen2.5-72B at 32k sequence length, with minimal accuracy loss.
Quick Take
Dustin introduces a sparse verification framework for long-context speculative decoding, achieving a 27.85x speedup in self-attention and a 9.17x end-to-end decoding speedup on Qwen2.5-72B at 32k sequence length, with minimal accuracy loss.
Key Points
- Dustin integrates draft model signals with historical attention for efficient token verification.
- The framework reduces recomputation latency by focusing on a minimal subset of attention heads.
- Evaluations on PG-19 and LongBench demonstrate significant performance improvements.
- Static eviction methods lead to accuracy loss, while dynamic selection incurs high overhead.
- Dustin addresses the KV cache loading bottleneck in long-context LLMs.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 24957v1 Announce Type: new Abstract: While speculative decoding improves inference throughput for multi-batch long-context Large Language Models (LLMs), its efficiency is often limited by a verification bottleneck where Key-Value (KV) cache loading dominates latency. Existing compression methods fail in this regime: static eviction incurs accuracy loss due to saliency shift, while dynamic selection introduces prohibitive computational overhead during the verification path.
We propose Dustin, a sparse verification framework designed for long-context speculative decoding. Dustin integrates lookahead signals from the draft model with historical attention from the target model to identify critical tokens with high fidelity across multi-step verification windows. To reduce recomputation latency, this approach further employs a sparse estimation scheme that restricts importance scoring to a minimal subset of attention heads. Evaluations on PG-19 and LongBench with Qwen2.
5-72B demonstrate that Dustin achieves a 27. 85x speedup in self-attention and a 9. 17x end-to-end decoding speedup at a 32k sequence length, all with negligible accuracy degradation.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.