We introduce EXIT, an extractive context compression framework that enhances both the effectiveness and efficiency of retrieval-augmented generation (RAG) in question answering (QA). Current RAG systems often struggle when retrieval models fail to rank the most relevant documents, leading to the inclusion of more context at the expense of latency and accuracy. While abstractive compression methods can drastically reduce token counts, their token-by-token generation process significantly increases end-to-end latency. Conversely, existing extractive methods reduce latency but rely on independent, non-adaptive sentence selection, failing to fully utilize contextual information. EXIT addresses these limitations by classifying sentences from retrieved documents - while preserving their contextual dependencies - enabling parallelizable, context-aware extraction that adapts to query complexity and retrieval quality. Our evaluations on both single-hop and multi-hop QA tasks show that EXIT consistently surpasses existing compression methods and even uncompressed baselines in QA accuracy, while also delivering substantial reductions in inference time and token count. By improving both effectiveness and efficiency, EXIT provides a promising direction for developing scalable, high-quality QA solutions in RAG pipelines. Our code is available at https://github.com/ThisIsHwang/EXIT
翻译:本文提出EXIT,一种抽取式上下文压缩框架,旨在提升检索增强生成(RAG)在问答(QA)任务中的效果与效率。现有RAG系统在检索模型未能对最相关文档进行正确排序时,常通过增加上下文长度来补偿,但这会牺牲延迟与准确率。虽然抽象式压缩方法能大幅减少token数量,但其逐token生成过程显著增加了端到端延迟。相反,现有抽取式方法虽能降低延迟,但依赖独立且非自适应的句子选择策略,未能充分利用上下文信息。EXIT通过在对检索文档句子进行分类的同时保持其上下文依赖关系,实现了可并行化的上下文感知抽取,并能自适应查询复杂度与检索质量。我们在单跳与多跳问答任务上的评估表明,EXIT在问答准确率上持续超越现有压缩方法甚至未压缩基线,同时显著降低推理时间与token数量。通过同步提升效果与效率,EXIT为开发可扩展、高质量的RAG管道问答解决方案提供了有前景的技术路径。代码已开源:https://github.com/ThisIsHwang/EXIT