LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding
Quick Take
LazyAttention introduces a novel attention mechanism that enables zero-copy, position-agnostic key-value reuse, improving inference efficiency in retrieval-augmented generation. It reduces time-to-first-token by 1.37x and increases throughput by 1.40x compared to Block-Attention, while maintaining output quality.
Key Points
- LazyAttention enables deferred positional encoding for efficient KV caching.
- Achieves 1.37x reduction in time-to-first-token under skewed document distributions.
- Increases inference throughput by 1.40x compared to state-of-the-art methods.
- Maintains comparable output quality while improving efficiency.
- Addresses limitations of conventional KV caching in long-context applications.
Article Content
From source RSS / original summaryarXiv:2606. 04302v1 Announce Type: new Abstract: Key-value (KV) caching accelerates inference of large language models (LLMs) by reusing past computations for generated tokens. Its importance becomes even greater in long-context applications such as retrieval-augmented generation (RAG) and in-context learning (ICL). However, conventional KV caching embeds positional information directly into the cache, limiting its reusability.
Existing solutions either restrict reuse to prefixes or require expensive memory materialization for positional re-encoding. We introduce LazyAttention, a novel attention mechanism that kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV reuse. By adjusting positional encoding within attention kernels on-the-fly, LazyAttention resolves the materialization bottleneck, allowing a single physical KV copy to serve multiple logical requests at arbitrary positions.
Leveraging attention kernels tailored for prefilling and decoding, our system achieves significant efficiency improvements: under skewed document distributions, it reduces time-to-first-token (TTFT) by 1. 37$\times$ and increases inference throughput by 1. 40$\times$ compared to the state-of-the-art Block-Attention, while maintaining comparable output quality.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.