Retrieval Augmented Generation faces a trade-off: concatenating documents in a long prompt enables multi-document reasoning but creates prefill bottlenecks, while encoding document KV caches separately offers speed but breaks cross-document interaction. We propose Parallel Context-of-Experts Decoding (Pced), a training-free framework that shifts evidence aggregation from the attention mechanism to the decoding. Pced treats retrieved documents as isolated "experts", synchronizing their predictions via a novel retrieval-aware contrastive decoding rule that weighs expert logits against the model prior. This approach recovers cross-document reasoning capabilities without constructing a shared attention across documents.
翻译:检索增强生成面临一个权衡:将文档拼接为长提示可实现多文档推理,但会造成预填充瓶颈;而单独编码文档键值缓存虽能提速,却会破坏跨文档交互。我们提出并行专家上下文解码,这是一种无需训练的框架,将证据聚合从注意力机制转移至解码过程。该方法将检索到的文档视为独立的“专家”,通过一种新颖的检索感知对比解码规则同步其预测结果,该规则根据模型先验对专家对数概率进行加权。此方法无需构建跨文档的共享注意力机制,即可恢复跨文档推理能力。