CONCORD: Asynchronous Sparse Aggregation for Device-Cloud RAG under Document Isolation
Quick Answer
CONCORD introduces an asynchronous sparse aggregation framework for device-cloud retrieval-augmented generation (RAG) under document isolation, improving throughput by 1.66x and 2.15x on Natural Questions and WikiText-2 benchmarks, respectively.
Quick Take
CONCORD introduces an asynchronous sparse aggregation framework for device-cloud (RAG) under document isolation, improving throughput by 1.66x and 2.15x on Natural Questions and WikiText-2 benchmarks, respectively. It reduces per-token communication significantly while maintaining answer quality.
Key Points
- CONCORD operates under a dual-end RAG setting with document isolation.
- It improves end-to-end throughput by 1.66x and 2.15x on specific benchmarks.
- The framework reduces per-token communication by over two orders of magnitude.
- Waiting debt control optimizes remote participation during decoding steps.
- Maintains comparable answer quality and perplexity to existing methods.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 15179v1 Announce Type: new Abstract: (RAG) has emerged as a pivotal technique for improving language models by incorporating external knowledge at inference time. As device-cloud collaborative inference makes it feasible to deploy small language models on edge devices, a new setting arises in which private documents remain on the device and public knowledge resides in the cloud.
Privacy and policy constraints often forbid raw document exchange, creating a document-isolated dual-end RAG setting. However, existing methods rely on frequent remote synchronization and dense evidence transfer, limiting throughput under realistic latency and bandwidth conditions. To address this issue, we propose CONCORD, an asynchronous sparse aggregation framework for dual-end RAG under document isolation.
CONCORD treats the cloud as an asynchronously arriving evidence source rather than a continuously synchronized co-generator. Specifically, we introduce waiting debt control to decide whether each decoding step should continue waiting for remote participation based on the observed return of waiting. We also design a certificate-guided minimal supplementation mechanism that requests only the remote evidence needed to determine the current greedy decision.
Steps that consult the cloud preserve the same greedy token as dense dual-end aggregation, while the remaining steps commit locally without remote evidence. Experiments on Natural Questions and WikiText-2 show that CONCORD improves end-to-end throughput over baselines by $1. 66\times$ and $2. 15\times$, respectively, while reducing per-token communication by over two orders of magnitude and maintaining comparable answer quality and perplexity.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.