CONCORD: Asynchronous Sparse Aggregation for Device-Cloud RAG under Document Isolation

Retrieval-augmented generation (RAG) has emerged as a pivotal technique for improving language models by incorporating external knowledge at inference time. As device-cloud collaborative inference makes it feasible to deploy small language models on edge devices, a new setting arises in which private documents remain on the device and public knowledge resides in the cloud. Privacy and policy constraints often forbid raw document exchange, creating a document-isolated dual-end RAG setting. However, existing methods rely on frequent remote synchronization and dense evidence transfer, limiting throughput under realistic latency and bandwidth conditions. To address this issue, we propose CONCORD, an asynchronous sparse aggregation framework for dual-end RAG under document isolation. CONCORD treats the cloud as an asynchronously arriving evidence source rather than a continuously synchronized co-generator. Specifically, we introduce waiting debt control to decide whether each decoding step should continue waiting for remote participation based on the observed return of waiting. We also design a certificate-guided minimal supplementation mechanism that requests only the remote evidence needed to determine the current greedy decision. Steps that consult the cloud preserve the same greedy token as dense dual-end aggregation, while the remaining steps commit locally without remote evidence. Experiments on Natural Questions and WikiText-2 show that CONCORD improves end-to-end throughput over baselines by $1.66\times$ and $2.15\times$, respectively, while reducing per-token communication by over two orders of magnitude and maintaining comparable answer quality and perplexity.

翻译：检索增强生成（RAG）已成为一种关键技术，通过在推理阶段融入外部知识来改进语言模型。随着设备-云协同推理使在边缘设备上部署小语言模型成为可能，一种新场景随之出现：私有文档保留在设备端，而公共知识存储于云端。隐私与政策约束通常禁止原始文档交换，由此形成了文档隔离的双端RAG设定。然而，现有方法依赖频繁的远程同步和密集的证据传输，在现实延迟与带宽条件下限制了吞吐量。为解决这一问题，我们提出了CONCORD，一种面向文档隔离下双端RAG的异步稀疏聚合框架。CONCORD将云端视为异步到达的证据源，而非持续同步的协同生成器。具体而言，我们引入等待债务控制机制，基于等待的已观测回报决定每个解码步骤是否应持续等待远程参与；同时设计了一种证书引导的最小补充机制，仅请求确定当前贪心决策所需的远程证据。咨询云端的步骤保留了与密集双端聚合相同的贪心词元，而其余步骤则在本地执行，无需远程证据。在Natural Questions和WikiText-2上的实验表明，CONCORD将端到端吞吐量相比基线分别提升了1.66倍和2.15倍，同时将每词元通信量降低了两个数量级以上，并保持了相当的答案质量与困惑度。