Large language model (LLM) agents require long-term user memory for consistent personalization, but limited context windows hinder tracking evolving preferences over long interactions. Existing memory systems mainly rely on static, hand-crafted update rules; although reinforcement learning (RL)-based agents learn memory updates, sparse outcome rewards provide weak supervision, resulting in unstable long-horizon optimization. Drawing on memory schema theory and the functional division between prefrontal regions and hippocampus regions, we introduce MemCoE, a cognition-inspired two-stage optimization framework that learns how memory should be organized and what information to update. In the first stage, we propose Memory Guideline Induction to optimize a global guideline via contrastive feedback interpreted as textual gradients; in the second stage, Guideline-Aligned Memory Policy Optimization uses the induced guideline to define structured process rewards and performs multi-turn RL to learn a guideline-following memory evolution policy. We evaluate on three personalization memory benchmarks, covering explicit/implicit preference and different sizes and noise, and observe consistent improvements over strong baselines with favorable robustness, transferability, and efficiency.
翻译:大语言模型(LLM)代理需要长期用户记忆以实现一致的个性化,但有限的上下文窗口阻碍了对长期交互中演化偏好的追踪。现有记忆系统主要依赖静态的手工设计更新规则;尽管基于强化学习(RL)的代理能够学习记忆更新,但稀疏的结果奖励提供的监督信号较弱,导致长期优化不稳定。借鉴记忆图式理论以及前额叶区域与海马区域的功能划分,我们提出MemCoE——一种受认知启发的两阶段优化框架,用于学习如何组织记忆以及应更新哪些信息。在第一阶段,我们提出记忆指南归纳,通过将对比反馈解释为文本梯度来优化全局指南;在第二阶段,指南对齐记忆策略优化利用归纳所得的指南定义结构化过程奖励,并执行多轮强化学习以学习遵循指南的记忆演化策略。我们在三个个性化记忆基准上进行评估,涵盖显式/隐式偏好以及不同规模与噪声水平,观察到相较于强基线方法的一致改进,并具有良好的鲁棒性、可迁移性与效率。