Large language models face challenges in long-context question answering, where key evidence of a query may be dispersed across millions of tokens. Existing works equip large language models with a memory buffer that is dynamically updated via a linear document scan, also known as the "memorize while reading" methods. While this approach scales efficiently, it suffers from pruning of latent evidence, information loss through overwriting, and sparse reinforcement learning signals. To tackle these challenges, we present ReMemR1, which integrates the mechanism of memory retrieval into the memory update process, enabling the agent to selectively callback historical memories for non-linear reasoning. To further strengthen training, we propose a multi-level reward design, which combines final-answer rewards with dense, step-level signals that guide effective memory use. Together, these contributions mitigate information degradation, improve supervision, and support complex multi-hop reasoning. Extensive experiments demonstrate that ReMemR1 significantly outperforms state-of-the-art baselines on long-context question answering while incurring negligible computational overhead, validating its ability to trade marginal cost for robust long-context reasoning.
翻译:大型语言模型在长上下文问答任务中面临挑战,因为查询的关键证据可能分散在数百万个token中。现有研究为大型语言模型配备了通过线性文档扫描动态更新的记忆缓冲区,即“边阅读边记忆”方法。虽然该方法具有高效的可扩展性,但其存在潜在证据被剪枝、覆盖写入导致的信息丢失以及稀疏强化学习信号等问题。为应对这些挑战,我们提出了ReMemR1,该模型将记忆检索机制整合到记忆更新过程中,使智能体能够选择性回调历史记忆以进行非线性推理。为进一步强化训练,我们提出了一种多级奖励设计,将最终答案奖励与指导有效记忆使用的密集步骤级信号相结合。这些贡献共同缓解了信息退化问题,改进了监督机制,并支持复杂的多跳推理。大量实验表明,ReMemR1在长上下文问答任务上显著优于现有最先进的基线模型,同时仅产生可忽略的计算开销,验证了其能够以边际成本换取鲁棒的长上下文推理能力。