Context-augmented generation (CAG) techniques, including RAG and ICL, require the efficient combination of multiple contexts to generate responses to user queries. Directly inputting these contexts as a sequence introduces a considerable computational burden by re-encoding the combined selection of contexts for every request. To address this, we explore the promising potential of parallel encoding to independently pre-compute and cache each context's KV states. This approach enables the direct loading of cached states during inference while accommodating more contexts through position reuse across contexts. However, due to misalignments in attention distribution, directly applying parallel encoding results in a significant performance drop. To enable effective and efficient CAG, we propose Adaptive Parallel Encoding ($\textbf{APE}$), which brings shared prefix, attention temperature, and scaling factor to align the distribution of parallel encoding with sequential encoding. Results on RAG and ICL tasks demonstrate that APE can preserve 98% and 93% sequential encoding performance using the same inputs while outperforming parallel encoding by 3.6% and 7.9%, respectively. It also scales to many-shot CAG, effectively encoding hundreds of contexts in parallel. Efficiency evaluation shows that APE can achieve an end-to-end 4.5$\times$ speedup by reducing 28$\times$ prefilling time for a 128K-length context.
翻译:上下文增强生成(CAG)技术,包括RAG和ICL,需要高效地组合多个上下文以生成对用户查询的响应。将这些上下文直接作为序列输入,会因每次请求都需重新编码组合后的上下文选择而引入相当大的计算负担。为解决此问题,我们探索了并行编码的潜力,即独立预计算并缓存每个上下文的KV状态。该方法允许在推理时直接加载缓存状态,同时通过跨上下文的位置重用来容纳更多上下文。然而,由于注意力分布的对齐偏差,直接应用并行编码会导致性能显著下降。为实现高效且有效的CAG,我们提出了自适应并行编码($\textbf{APE}$),它引入了共享前缀、注意力温度和缩放因子,以使并行编码的分布与顺序编码对齐。在RAG和ICL任务上的结果表明,使用相同输入时,APE可以分别保持顺序编码性能的98%和93%,同时分别优于并行编码3.6%和7.9%。它还能扩展到多示例CAG,有效地并行编码数百个上下文。效率评估表明,对于长度为128K的上下文,APE通过减少28倍的预填充时间,可以实现端到端4.5倍的加速。