A common strategy to reduce the computational costs of using long contexts in retrieval-augmented generation (RAG) with large language models (LLMs) is soft context compression, where the input sequence is transformed into a shorter continuous representation. We develop a lightweight and simple mean-pooling approach that consistently outperforms the widely used compression-tokens architecture, and study training the same compressor to output multiple compression ratios. We conduct extensive experiments across in-domain and out-of-domain QA datasets, as well as across model families, scales, and compression ratios. Overall, our simple mean-pooling approach achieves the strongest performance, with a relatively small drop when training for multiple compression ratios. More broadly though, across architectures and training regimes the trade-offs are more nuanced, illustrating the complex landscape of compression methods.
翻译:在大型语言模型检索增强生成中,为降低长上下文计算成本,常采用软上下文压缩策略——将输入序列转换为更短的连续表示。我们提出了一种轻量级且简单的均值池化方法,其性能持续优于广泛使用的压缩标记架构,并研究了训练同一压缩器输出多种压缩比例的方法。我们在领域内外问答数据集上进行了广泛实验,涵盖不同模型族、规模及压缩比例。总体而言,我们的简单均值池化方法实现了最优性能,即使在训练多压缩比例时性能下降也相对较小。然而更广泛地看,不同架构与训练机制间的权衡关系更为微妙,这揭示了压缩方法领域的复杂性。