Large language models face challenges in long-context question answering, where key evidence of a query may be dispersed across millions of tokens. Existing works equip large language models with a memory buffer that is dynamically updated via a linear document scan, also known as the "memorize while reading" methods. While this approach scales efficiently, it suffers from pruning of latent evidence, information loss through overwriting, and sparse reinforcement learning signals. To tackle these challenges, we present ReMemR1, which integrates the mechanism of memory retrieval into the memory update process, enabling the agent to selectively callback historical memories for non-linear reasoning. To further strengthen training, we propose a multi-level reward design, which combines final-answer rewards with dense, step-level signals that guide effective memory use. Together, these contributions mitigate information degradation, improve supervision, and support complex multi-hop reasoning. Extensive experiments demonstrate that ReMemR1 significantly outperforms state-of-the-art baselines on long-context question answering while incurring negligible computational overhead, validating its ability to trade marginal cost for robust long-context reasoning.
翻译:大语言模型在长上下文问答任务中面临挑战,其中查询的关键证据可能分散在数百万个标记中。现有研究通过线性文档扫描动态更新记忆缓冲区(即“边读边记”方法)来增强大语言模型。虽然该方法具有高效的可扩展性,但存在潜在证据被剪枝、覆盖写入导致信息丢失以及强化学习信号稀疏等问题。为应对这些挑战,我们提出ReMemR1模型,将记忆检索机制整合到记忆更新过程中,使智能体能够通过非线性推理选择性回调历史记忆。为进一步强化训练,我们提出多层级奖励设计,将最终答案奖励与指导有效记忆使用的密集步骤级信号相结合。这些贡献共同缓解了信息退化问题,改善了监督机制,并支持复杂的多跳推理。大量实验表明,ReMemR1在长上下文问答任务中显著优于现有最先进基线模型,且仅产生可忽略的计算开销,验证了其能以边际成本换取稳健长上下文推理的能力。