JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting
Quick Answer
JetFlow introduces a novel speculative decoding framework that enhances autoregressive LLMs by achieving up to 9.64x speedup on MATH-500 and 4.58x on conversational tasks, outperforming existing methods.
Quick Take
JetFlow introduces a novel speculative decoding framework that enhances autoregressive LLMs by achieving up to 9.64x speedup on MATH-500 and 4.58x on conversational tasks, outperforming existing methods. By integrating causal conditioning with efficient drafting, JetFlow maximizes draft budgets for longer accepted prefixes and improved performance on dense and MoE Qwen3 models.
Key Points
- JetFlow combines one-forward drafting efficiency with branch-wise causal conditioning.
- Achieves 9.64x speedup on MATH-500 and 4.58x on conversational workloads.
- Outperforms bidirectional-head and tree-based SD baselines across various benchmarks.
- Utilizes fused hidden states from a frozen target model for improved candidate tree scoring.
- Code and models available at https://github.com/hao-ai-lab/JetFlow.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 18394v1 Announce Type: new Abstract: Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling has been difficult to break because prior head-based SD methods face a causality-efficiency dilemma.
Autoregressive drafters produce path-conditioned candidates that are effective for tree speculative decoding with higher acceptance length, but their drafting cost grows with tree depth. Bidirectional block-diffusion drafters generate all positions in one pass, but their branch-agnostic marginals can form individually plausible yet mutually inconsistent trees, wasting budget and reducing acceptance.
We propose JetFlow, a head-based SD framework that combines one-forward drafting efficiency with branch-wise causal conditioning. JetFlow trains a causal parallel draft head over fused hidden states from the frozen target model, producing candidate trees whose scores align with the target model's autoregressive factorization. This enables JetFlow to convert larger draft budgets into longer accepted prefixes and higher end-to-end speedup.
Across math, coding, and chat benchmarks on dense and MoE Qwen3 models, JetFlow consistently outperforms bidirectional-head and tree-based SD baselines. On H100 GPUs, JetFlow achieves up to 9. 64x speedup on MATH-500 and 4. 58x on open-ended conversational workloads, with further latency gains demonstrated through vLLM integration under realistic serving loads. Our code and models are available at https://github. com/hao-ai-lab/JetFlow.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.