Understanding and reasoning over long contexts is a crucial capability for language models (LMs). Although recent models support increasingly long context windows, their accuracy often deteriorates as input length grows. In practice, models often struggle to keep attention aligned with the most relevant context throughout decoding. In this work, we propose DySCO, a novel decoding algorithm for improving long-context reasoning. DySCO leverages retrieval heads--a subset of attention heads specialized for long-context retrieval--to identify task-relevant tokens at each decoding step and explicitly up-weight them. By doing so, DySCO dynamically adjusts attention during generation to better utilize relevant context. The method is training-free and can be applied directly to any off-the-shelf LMs. Across multiple instruction-tuned and reasoning models, DySCO consistently improves performance on challenging long-context reasoning benchmarks, yielding relative gains of up to 25% on MRCR and LongBenchV2 at 128K context length with modest additional compute. Further analysis highlights the importance of both dynamic attention rescaling and retrieval-head-guided selection for the effectiveness of the method, while providing interpretability insights into decoding-time attention behavior. Our code is available at https://github.com/princeton-pli/DySCO.
翻译:理解和推理长上下文是语言模型的一项关键能力。尽管近期模型支持越来越长的上下文窗口,但其准确性常随输入长度增加而下降。实践中,模型在解码过程中往往难以保持注意力始终与最相关的上下文对齐。本工作提出DySCO,一种用于提升长上下文推理性能的新型解码算法。DySCO利用检索头——专门用于长上下文检索的注意力头子集——在每一步解码时识别任务相关词元并显式提升其权重。通过这种方式,DySCO在生成过程中动态调整注意力以更好地利用相关上下文。该方法无需训练,可直接应用于任何现成的语言模型。在多个指令微调模型和推理模型上,DySCO在具有挑战性的长上下文推理基准测试中持续提升性能,在128K上下文长度下,于MRCR和LongBenchV2基准上实现了最高达25%的相对性能提升,且仅需适度的额外计算开销。进一步分析表明,动态注意力重缩放和检索头引导的选择机制共同构成了该方法有效性的关键,同时为解码阶段的注意力行为提供了可解释性洞见。我们的代码公开于https://github.com/princeton-pli/DySCO。