CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded… | AI Deep Signal

CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference

arXiv cs.CL·Kaizhen Tan, Rong Gu, Mingyuan Li

6/19/2026

·~2 min·6/19/2026·en·1

Quick Answer

CacheWeaver introduces a prompt-layer method that enhances retrieval-augmented generation (RAG) by optimizing evidence ordering, achieving a 20-33% reduction in median time-to-first-token (TTFT) across three vLLM configurations without compromising answer quality.

Quick Take

This approach leverages a prefix tree for efficient cache-aware ordering, significantly improving performance in grounded generation tasks.

Key Points

CacheWeaver reduces median TTFT by 20-33% in grounded generation tasks.
Utilizes a prefix tree to optimize evidence ordering without altering retrieval sets.
Achieves 97.5% of the TTFT gain compared to oracle ordering with a simple scheduling layer.
Improves efficiency in (RAG) applications.
Maintains answer quality during performance enhancements across vLLM configurations.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

(RAG) improves factual grounding, but it also lengthens prompts and raises prefill cost. Prefix caching in serving engines such as vLLM reduces this cost only when requests share the same token prefix. In grounded generation, however, adjacent queries may retrieve overlapping evidence in different orders, so set overlap does not become reusable prefix overlap. We present CacheWeaver, a lightweight prompt-layer method for cache-aware evidence ordering. The method ke

Read the full article on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Isabel Xu (The Overlake School), Cynthia Xu (The Overlake School), Rachel Ren (Edwards Vacuum Inc.), Cong Guo (The University of Memphis), Jiacheng Ding (The University of Memphis)

1w ago

FeaturedOriginal

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis

AI Summary

TriAgent introduces a cost-efficient multi-agent system for financial sentiment analysis, combining VADER, FinBERT, and Qwen2.5. It achieves an F1 score of ~0.87 with significant savings of $9.3M/year at a 10M-user scale compared to GPT-4o-mini, while also detecting hallucinations with an AUC of 0.90.

#LLM #Agent #AI Startup #Enterprise AI

CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

TriAgent: Divergence-Aware Multi-Agent Committees for Cost-Efficient Financial Sentiment Analysis

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis