Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents

Memory-augmented LLM agents enable interactions that extend beyond finite context windows by storing, updating, and reusing information across sessions. However, training such agents with reinforcement learning in multi-session environments is challenging because memory turns the agent's past actions into part of its future environment. Once different rollouts write, update, or delete different memories, they no longer share the same intermediate memory state, making trajectory-level comparisons fundamentally unfair. This violates a key assumption behind group-relative methods such as GRPO, where rollouts are compared as if they were sampled from the same effective environment. Consequently, trajectory-level rewards provide noisy or biased credit signals for long-horizon memory operations. To address this challenge, we introduce Memory-R2, a training framework for long-horizon memory-augmented LLM agents. Its core algorithm, LoGo-GRPO, combines local and global group-relative optimization. The global objective preserves end-to-end learning from long-horizon trajectory-level rewards, while local rerollouts compare different memory-operation outcomes from the same intermediate memory state, yielding fairer group comparisons and more precise supervision for memory construction. Beyond credit assignment, Memory-R2 jointly optimizes memory formation and memory evolution with a shared-parameter co-learning design, where a fact extractor and a memory manager are instantiated from the same LLM backbone through role-specific prompts. To stabilize multi-step RL over long memory horizons, we adopt a progressive curriculum that increases the training horizon from 8 to 16 to 32 sessions. Together, these components provide an effective training paradigm for memory-augmented LLM agents in long-horizon multi-session settings.

翻译：记忆增强型大语言模型（LLM）智能体通过跨会话存储、更新和重用信息，实现了超越有限上下文窗口的交互。然而，在多会话环境中使用强化学习训练此类智能体颇具挑战性，因为记忆机制将智能体的过去行为转化为其未来环境的一部分。当不同轨迹写入、更新或删除不同记忆时，它们不再共享相同的中间记忆状态，这使得轨迹级比较本质上不公平。这违反了分组相对方法（如GRPO）的关键假设——即不同轨迹被比较时需假定采样自相同的有效环境。因此，轨迹级奖励为长时程记忆操作提供了有噪声或有偏差的信用信号。为应对这一挑战，我们提出了Memory-R2——一种面向长时程记忆增强型LLM智能体的训练框架。其核心算法LoGo-GRPO融合了局部与全局分组相对优化。全局目标函数保留了从长时程轨迹级奖励中进行端到端学习的能力，而局部重施轨迹则从相同的中间记忆状态出发，比较不同记忆操作的结果，从而在记忆构建过程中实现更公平的分组比较和更精确的监督。除信用分配外，Memory-R2通过共享参数协同学习设计联合优化记忆形成与记忆演化，其中事实提取器和记忆管理器基于同一LLM骨干网络，通过角色特定提示进行实例化。为稳定长记忆时程下的多步强化学习，我们采用渐进式课程设计，将训练时程从8个会话逐步扩展至16个、32个会话。这些组件的协同作用为长时程多会话场景下的记忆增强型LLM智能体提供了一种有效的训练范式。