Retrieval-augmented generation (RAG) often suffers from long and noisy retrieved contexts. Existing context compression methods typically rely on heuristic relevance estimation or supervised compression models rather than on how LLMs utilize retrieved context during inference. We propose Sentinel, a lightweight sentence-level compression framework that decodes inference-time contextual utilization behaviors from head-wise attention patterns of frozen LLMs. To ground supervision in retrieval-dependent answering behavior, Sentinel trains a lightweight probe using QA examples where the model succeeds only when retrieved context is available. Sentinel performs compression using only a single non-autoregressive forward pass without dedicated compression training or autoregressive scoring. Empirically, we find that effective contextual utilization signals remain accessible even in compact proxy models. On LongBench, Sentinel with a 0.5B proxy model achieves up to 5$\times$ compression while attaining question-answering performance competitive with compression methods built on 7B-scale models. Despite being trained only on English QA data, Sentinel also generalizes effectively to Chinese and out-of-domain settings.
翻译:检索增强生成(RAG)常受限于冗长且嘈杂的检索上下文。现有上下文压缩方法通常依赖启发式相关性估计或监督式压缩模型,而非探究大语言模型在推理过程中如何利用检索上下文。我们提出哨兵(Sentinel),一种轻量级句子级压缩框架,通过冻结大语言模型的头级注意力模式解码其推理时的上下文利用行为。为将监督信号锚定于依赖检索的回答行为,哨兵利用仅在检索上下文可用时模型才能成功回答的问答示例训练轻量级探针。哨兵无需专用压缩训练或自回归评分,仅通过单次非自回归前向传播即可执行压缩。实验表明,即使采用紧凑代理模型,有效的上下文利用信号仍可被捕获。在LongBench上,采用0.5B代理模型的哨兵在实现高达5倍压缩的同时,其问答性能可与基于7B规模模型的压缩方法相匹敌。尽管仅使用英文问答数据训练,哨兵在中文及跨领域场景中仍展现出良好的泛化能力。