Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

Long-horizon language agents accumulate observations, reasoning traces, and retrieved facts that exceed their finite context windows, making memory retention a fundamental resource-allocation problem. Existing memory systems improve management through heuristic scoring, retrieval optimization, or learned compression, but largely treat retention as a local decision problem and do not explicitly model its long-term consequences under realistic observability constraints. To fill this gap, we formulate memory retention as a constrained stochastic optimization problem with explicit budget feasibility, evidence utility, and delayed costs including miss penalties, reacquisition delays, and stale-information risk. We then propose OSL-MR (Observability-Safe Learning for Memory Retention), a novel framework that enforces a strict separation between online-observable features and offline-available supervision (OAS). OSL-MR combines an evidence learner trained from realized evidence supervision with a Mixed-Score heuristic that serves both as a deployable online-safe baseline and as a structured inductive prior for learning. The resulting policy learns query-conditioned evidence value directly from interaction data while remaining deployable under the same observability constraints. Experiments on LOCOMO and LongMemEval show that OSL-MR consistently outperforms recency-based methods, Generative Agents-style scoring, and other heuristic baselines, particularly under tight memory budgets. The Mixed-Score prior further improves precision while preserving recall, and sensitivity analysis demonstrates robustness across a wide range of cost configurations.

翻译：长周期语言智能体积累的观测、推理轨迹和检索事实会超出其有限的上下文窗口，使得记忆保持成为一个基础性的资源分配问题。现有记忆系统通过启发式评分、检索优化或学习压缩来改进管理，但大多将记忆保持视为局部决策问题，未能在现实观测约束下显式建模其长期后果。为填补这一空白，我们将记忆保持形式化为一个带显式预算可行性、证据效用以及延迟成本（包括遗漏惩罚、重新获取延迟和过时信息风险）的约束随机优化问题。为此，我们提出OSL-MR（面向记忆保留的观测安全学习），这是一种新颖框架，其核心思想是严格分离在线观测特征与离线监督信息（OAS）。OSL-MR将基于实现证据监督训练的证据学习器与混合评分启发式方法相结合，后者既可部署为在线安全基线，又可作为结构化归纳先验指导学习。所得策略可直接从交互数据中学习查询条件相关的证据价值，同时在相同观测约束下保持可部署性。在LOCOMO和LongMemEval上的实验表明，OSL-MR在严格内存预算下持续优于基于时效性的方法、生成式智能体风格评分及其他启发式基线。混合评分先验在保持召回率的同时进一步提升了精确度，敏感性分析则证明了其在多种成本配置下的鲁棒性。