A Policy-Driven Runtime Layer for Agentic LLM Serving
Quick Take
A new agent runtime layer, CacheSage, enhances multi-agent LLM serving by improving cache hit rates by 13-37 pp and reducing mean TTFT by 12-29%. This architectural change addresses cross-cutting policies and optimizes KV caching across sessions, demonstrating significant performance gains in real workloads.
Key Points
- Introduces an agent runtime layer to bridge framework and engine in LLM systems.
- CacheSage learns agent transition matrices online for improved KV caching.
- Achieves 13-37 pp increase in cache hit rates across five real workloads.
- Reduces mean TTFT by 12-29% and increases throughput by 6-14% over unmodified stacks.
- Addresses critical cross-cutting policies like fairness and safety enforcement.
Article Content
From source RSS / original summaryarXiv:2605. 27744v1 Announce Type: new Abstract: Multi-agent LLM systems have become the dominant production workload, but the serving stack was not built for them. The agent framework above knows agent identities, role, schemas, and dispatch structure but never sees an engine-level event; the serving engine below sees every event but knows nothing about agents.
A surprising number of cross-cutting policies depend on both: prefix caching, batch shaping, speculative execution, fairness, tool-result memoization, safety enforcement, and more. Each lives in the seam between the two layers and is currently solved by a one-off patch into one neighbor or the other.
We argue this seam is best addressed by an architectural change rather than point fixes: insert a third tier, an agent runtime layer, between the framework and the engine, exposing four primitives (observe, score, predict, act) into which any agent-aware policy plugs, with agent identity as the shared coordinate.
We map nine concrete policies onto the layer and validate the abstraction in depth on the one with the largest immediate serving-cost lever: KV caching across sessions, instantiated as CacheSage, which learns the per-workload agent transition matrix online and uses it for survival-based eviction and between-step prefetch. Preliminary results on five real multi-agent workloads show +13 to +37 pp cache hit-rate lift, 12% to 29% lower mean TTFT, and 6% to 14% higher throughput over an unmodified serving stack.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane
The Redpanda Agentic Data Plane (ADP) introduces out-of-band metadata channels to enhance the safety of autonomous AI agents, ensuring secure data access and tamper-proof audit trails. This architecture mitigates risks associated with unpredictable AI behavior by enforcing governance throughout the agent lifecycle, demonstrated in a multi-agent trading system with strict data scoping and approval thresholds.