Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of NLP tasks, but they remain fundamentally stateless, constrained by limited context windows that hinder long-horizon reasoning. Recent efforts to address this limitation often augment LLMs with an external memory bank, yet most existing pipelines are static and heuristic-driven, lacking any learned mechanism for deciding what to store, update, or retrieve. We present Memory-R1, a reinforcement learning (RL) framework that equips LLMs with the ability to actively manage and utilize external memory through two specialized agents: a Memory Manager that learns to perform structured memory operations {ADD, UPDATE, DELETE, NOOP}, and an Answer Agent that selects the most relevant entries and reasons over them to produce an answer. Both agents are fine-tuned with outcome-driven RL (PPO and GRPO), enabling adaptive memory management and use with minimal supervision. With as few as 152 question-answer pairs and a corresponding temporal memory bank for training, Memory-R1 outperforms the most competitive existing baseline and demonstrates strong generalization across diverse question types and LLM backbones. Beyond presenting an effective approach, this work provides insights into how RL can unlock more agentic, memory-aware behaviors in LLMs, pointing toward richer, more persistent reasoning systems.
翻译:大型语言模型(LLMs)在广泛的自然语言处理任务中展现出令人印象深刻的能力,但其本质上仍是无状态的,受限于有限的上下文窗口,阻碍了长程推理。近期为应对这一局限性的研究通常通过外部记忆库来增强LLMs,然而现有的大多数流程是静态且启发式驱动的,缺乏决定存储、更新或检索内容的学习机制。我们提出了Memory-R1,一个强化学习(RL)框架,它通过两个专用代理赋予LLMs主动管理和利用外部记忆的能力:一个学习执行结构化记忆操作{ADD, UPDATE, DELETE, NOOP}的**记忆管理器**,以及一个选择最相关条目并基于其进行推理以生成答案的**应答代理**。两个代理均通过结果驱动的强化学习(PPO和GRPO)进行微调,实现了自适应的记忆管理与使用,且仅需极少的监督。仅使用152个问答对及对应的时间性记忆库进行训练,Memory-R1便超越了现有最具竞争力的基线模型,并在多样化问题类型和LLM骨干网络上展现出强大的泛化能力。除了提出一种有效方法外,本研究还深入探讨了强化学习如何能解锁LLMs中更具代理性、记忆感知的行为,为构建更丰富、更持久的推理系统指明了方向。