Large language models with long context windows can answer complex questions directly from full-length academic, technical, and policy documents, but passing entire documents is often costly, slow, and can degrade answer quality while increasing the risk of unnecessary data leakage. This paper targets the common setting of answering many heterogeneous questions over long document(s), where fixed position heuristics and standard retrieval-augmented generation (RAG) can fail due to document structure variability and weak query-chunk semantic similarity, which often requires task- and domain-specific tuning of embedding retrievers. We propose {Selective Attention-Guided Extraction} (\ourmethod), a training-free, plug-and-play context reduction framework that uses a lightweight local LLM to perform a single prefilling pass and convert language model attention signals into a query-specific relevance heatmap at configurable granularities. \ourmethod\ further introduces \emph{differential attention} strategies to better isolate question-relevant evidence, then selects the top-scoring units under a user-defined token budget and forwards only this reduced context to a downstream LLM for answer generation. \ourmethod\ surpasses traditional reduction techniques across multiple long-document QA benchmarks, notably securing a top-4 rank on QuALITY-hard while constrained to a 10\% context budget. This enables a 90\% reduction in tokens with competitive accuracy, without the need for model fine-tuning or complex calibration.
翻译:具备长上下文窗口的大型语言模型能够直接从完整的学术、技术及政策类文档中回答复杂问题,但传输整份文档往往成本高昂、速度缓慢,且可能降低回答质量并增加不必要的数据泄露风险。本文针对在长篇文档上回答海量异构问题的常见场景展开研究——由于文档结构多变及查询-文本块语义相似度低下,固定位置启发式方法及标准检索增强生成(RAG)常无法有效应对,通常需要对嵌入检索器进行任务与领域特定的调优。我们提出选择性注意力引导的提取方法(Selective Attention-Guided Extraction, SAGE),这是一个无需训练、即插即用的上下文精简框架,通过轻量级本地语言模型执行单次预填充过程,将语言模型注意力信号转化为可配置粒度的查询相关热力图。SAGE方法进一步引入差分注意力(differential attention)策略,以更精准地分离与问题相关的证据,随后在用户定义的标记预算下筛选得分最高的文本单元,仅将精简后的上下文传递至下游语言模型进行答案生成。在多个长文档问答基准测试中,SAGE超越了传统精简技术,尤其在QuALITY-hard数据集上,在仅使用10%上下文预算的情况下取得了前四名的成绩。该方法在不依赖模型微调或复杂校准的前提下,实现了90%的标记压缩与具有竞争力的准确率。