Long-lived AI agents are increasingly deployed as persistent operational systems, yet they are still evaluated like freshly initialized models. Day-one benchmarks miss a basic systems question: how long does an agent remain reliable after deployment? Even when model weights are frozen, an agent's effective state keeps changing as it compresses interaction history, retrieves from a growing memory store, revises facts after updates, and undergoes routine maintenance. Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model. We introduce AgingBench, a longitudinal reliability benchmark for agent lifespan engineering: measuring not only whether deployed agents degrade, but what form the degradation takes and where repair should target. AgingBench organizes agent aging into four mechanisms: compression aging, interference aging, revision aging, and maintenance aging. To diagnose these failures, AgingBench uses temporal dependency graphs and paired counterfactual probes that produce diagnostic profiles for the write, retrieval, and utilization stages of the memory pipeline. Across 7 scenarios, 14 models, multiple memory policies, and both runner-controlled and autonomous agents, over ~400 runs spanning 8 - 200 sessions show that agent aging is not one-dimensional: behavioral tests can remain clean while factual precision decays; derived-state tracking can collapse sharply within a single model; and the same wrong answer can require different repairs depending on what the diagnostic profile points to. These results suggest that reliable agent deployment requires lifespan evaluation, mechanism-level diagnosis, and stage-targeted repair, not only stronger day-one models.
翻译:长期运行的AI智能体正越来越多地被部署为持久性操作系统,然而它们仍像刚初始化的模型一样被评估。首日基准测试忽略了一个基本的系统问题:智能体在部署后能保持可靠运行多长时间?即使模型权重被冻结,智能体的有效状态也会持续变化——它压缩交互历史、从不断增长的记忆库中检索、在更新后修正事实,并经历常规维护。因此,可靠性成为整个智能体框架的寿命属性,而非仅限于基础模型的快照属性。我们提出AgingBench,一个面向智能体寿命工程的纵向可靠性基准:不仅测量已部署智能体是否退化,还诊断退化的具体形式及修复目标。AgingBench将智能体老化归纳为四种机制:压缩老化、干扰老化、修正老化和维护老化。为诊断这些故障,AgingBench采用时间依赖图与配对反事实探针,生成针对记忆管道写入、检索和利用阶段的诊断概况。在7个场景、14个模型、多种记忆策略以及运行器控制和自主智能体中,跨越8至200个会话周期的约400次运行表明:智能体老化并非单一维度——行为测试可能保持良好而事实准确性却持续衰减;派生状态追踪可能在单个模型内急剧崩溃;相同的错误答案根据诊断概况指向需要不同的修复策略。这些结果表明,可靠智能体部署需要寿命评估、机制级诊断和阶段定向修复,而不仅仅是更强的首日模型。