We introduce EXIT, an extractive context compression framework that enhances both the effectiveness and efficiency of retrieval-augmented generation (RAG) in question answering (QA). Current RAG systems often struggle when retrieval models fail to rank the most relevant documents, leading to the inclusion of more context at the expense of latency and accuracy. While abstractive compression methods can drastically reduce token counts, their token-by-token generation process significantly increases end-to-end latency. Conversely, existing extractive methods reduce latency but rely on independent, non-adaptive sentence selection, failing to fully utilize contextual information. EXIT addresses these limitations by classifying sentences from retrieved documents - while preserving their contextual dependencies - enabling parallelizable, context-aware extraction that adapts to query complexity and retrieval quality. Our evaluations on both single-hop and multi-hop QA tasks show that EXIT consistently surpasses existing compression methods and even uncompressed baselines in QA accuracy, while also delivering substantial reductions in inference time and token count. By improving both effectiveness and efficiency, EXIT provides a promising direction for developing scalable, high-quality QA solutions in RAG pipelines. Our code is available at https://github.com/ThisIsHwang/EXIT
翻译:本文提出EXIT,一种抽取式上下文压缩框架,旨在提升检索增强生成在问答任务中的效能与效率。当前RAG系统在检索模型未能对最相关文档进行有效排序时表现不佳,往往通过增加上下文长度来弥补,但这会以延迟和准确率为代价。虽然抽象式压缩方法能大幅减少标记数量,但其逐标记生成过程显著增加了端到端延迟。相反,现有抽取式方法虽能降低延迟,但依赖独立且非自适应的句子选择机制,未能充分利用上下文信息。EXIT通过分类检索文档中的句子(同时保持其上下文依赖关系),实现了可并行化的上下文感知抽取,能够自适应查询复杂度与检索质量。我们在单跳与多跳问答任务上的评估表明,EXIT在问答准确率上持续超越现有压缩方法甚至未压缩基线,同时显著降低推理时间与标记数量。通过同步提升效能与效率,EXIT为开发可扩展的高质量RAG流水线问答解决方案提供了有前景的研究方向。代码已开源:https://github.com/ThisIsHwang/EXIT