Long-horizon language agents accumulate observations, reasoning traces, and retrieved facts that exceed their context windows, making memory retention -- what to keep, discard, or later recover under a fixed budget -- central to sustained performance. Most systems score memories with local rules such as recency or relevance, ignoring the delayed costs of retention: future retrieval failures, recomputation, and stale-information use. We formulate retention as a constrained, partially observable stochastic optimization problem in which current decisions shape information demands revealed only later, and prove its single-step version NP-hard. Since exact optimization is intractable and future demands unknown, we develop \textbf{OSL-MR} (Observability-Safe Learning for Memory Retention), a learning-augmented approximation for deployable memory control. Its core principle is observability separation: deployed decisions use only online-observable signals, while supervision from evidence realized after an interaction is used solely for offline learning. OSL-MR pairs a budget-aware Mixed-Score heuristic (a cold-start policy and inductive prior) with an evidence learner predicting which memories later serve as evidence. As the cumulative objective is non-decomposable and combinatorial, the learner is trained on evidence-membership signals rather than reward, a tractable, deployable target. On LoCoMo and LongMemEval, OSL-MR consistently outperforms strong heuristic and imitation-learning baselines, especially under tight budgets, and is robust across cost settings. On exactly-solvable instances, retention is genuinely multi-step: a perfect single-step optimizer is far from optimal, whereas OSL-MR stays near the dynamic-programming optimum. These results establish constrained stochastic optimization and optimization-guided learning as a scalable foundation for memory in long-horizon agents.
翻译:长周期语言智能体会积累超出其上下文窗口的观测、推理轨迹与检索事实,这使得在固定预算下决定保留、丢弃或后续恢复哪些内容的记忆保留机制成为维持持续性能的核心。多数系统采用基于局部规则(如时效性或相关性)对记忆进行评分,忽略了保留带来的延迟代价:未来检索失败、重复计算以及使用过时信息。我们将记忆保留问题形式化为一个受约束的部分可观测随机优化问题——当前决策将影响仅在后续阶段才显现的信息需求,并证明其单步变体为NP难问题。由于精确优化不可行且未来需求未知,我们提出OSL-MR(面向记忆保留的可观测安全学习),这是一种面向可部署记忆控制的增强近似方案。其核心原则是可观测性分离:部署决策仅使用在线可观测信号,而交互完成后的证据监督仅用于离线学习。OSL-MR将预算感知混合评分启发式方法(冷启动策略与归纳先验)与预测哪些记忆将成为后续证据的证据学习器相结合。由于累积目标函数具有不可分解与组合特性,学习器基于证据归属信号(而非奖励)进行训练,从而获得可行且可部署的优化目标。在LoCoMo和LongMemEval基准测试中,OSL-MR持续超越强启发式方法与模仿学习方法(尤其在严格预算约束下),并在多种代价设定下保持鲁棒性。在可精确求解的实例中,记忆保留本质上具有多步特性:完美的单步优化器远非最优,而OSL-MR始终接近动态规划最优值。这些结果确立了约束随机优化与优化引导学习作为长周期智能体记忆可扩展基础的可行性。