CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference
Quick Answer
CacheWeaver introduces a prompt-layer method that enhances retrieval-augmented generation (RAG) by optimizing evidence ordering, achieving a 20-33% reduction in median time-to-first-token (TTFT) across three vLLM configurations without compromising answer quality.
Quick Take
CacheWeaver introduces a prompt-layer method that enhances (RAG) by optimizing evidence ordering, achieving a 20-33% reduction in median time-to-first-token (TTFT) across three vLLM configurations without compromising answer quality. This approach leverages a prefix tree for efficient cache-aware ordering, significantly improving performance in grounded generation tasks.
Key Points
- CacheWeaver reduces median TTFT by 20-33% in grounded generation tasks.
- Utilizes a prefix tree to optimize evidence ordering without altering retrieval sets.
- Achieves 97.5% of the TTFT gain compared to oracle ordering with a simple scheduling layer.
- Improves efficiency in retrieval-augmented generation (RAG) applications.
- Maintains answer quality during performance enhancements across vLLM configurations.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 19667v1 Announce Type: new Abstract: (RAG) improves factual grounding, but it also lengthens prompts and raises prefill cost. Prefix caching in serving engines such as vLLM reduces this cost only when requests share the same token prefix. In grounded generation, however, adjacent queries may retrieve overlapping evidence in different orders, so set overlap does not become reusable prefix overlap.
We present CacheWeaver, a lightweight prompt-layer method for cache-aware evidence ordering. The method keeps a prefix tree over recently served evidence sequences and uses a greedy walk to place the most reusable prefix first, while leaving the serving engine and retrieved evidence set unchanged. Across three vLLM configurations, the method lowers median time-to-first-token (TTFT) by about 20-33 percent relative to retrieval-order prefix caching, without hurting answer quality in our QA tests.
The greedy policy reaches 97. 5 percent of the median TTFT gain from oracle ordering, indicating that most reusable prefix locality can be recovered by a simple scheduling layer between retrieval and inference.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.