Where an LLM sits in an agent memory pipeline -- between the recall plane that retrieves stored facts (extensively benchmarked) and the control plane that mutates them via supersede, release, purge (largely untested) -- shapes which forgetting failure modes the system recovers. Comparing thirteen system configurations on a 385-case adversarial surface, we observe three placement regimes with partly complementary coverage: deterministic primitives suffice for lexical/temporal categories but fail canonicalization (5% on identifier-obfuscation, 0% on cross-lingual); inscribe-time LLM recovers canonicalization (100%) but cannot help intent-aware deletion (0% on prefix-collision and compound-fact); a mutation-time hook recovers intent-aware deletion (78-85%) and brightens nearly all categories simultaneously (91.7-93.2% overall, $0.17 per 385-case run, 2.3s/case mutation latency vs. 64-191ms/case deterministic, recall path unchanged). We expose the trade-off via ForgetEval, a 1000-case templated suite plus a 385-case adversarial layer (132 hand-crafted + 253 LLM-drafted oracle-validated) scored by deterministic substring match, paired with a six-method Adapter Protocol with honest N/A scoring that lets heterogeneous memory stores enter in 130 lines. Admission is corroborated by 10-annotator IAA (Fleiss' kappa = 0.958) and a 77-case external-authored subset (four blind contributors) that replicates the canonicalization asymmetry and amplifies the joint-placement lift (+27.8 pt). Production failures are predominantly forgetting failures rather than recall failures, yet existing benchmarks measure only recall. ForgetEval and all adapters are released under MIT.
翻译:大语言模型(LLM)在智能体记忆管线中的位置——位于检索存储事实(已被广泛基准测试)的回忆平面与通过替换、释放、清空操作(基本未经测试)对其进行变异的控制平面之间——决定了系统对哪些遗忘故障模式能够恢复。通过在385个对抗性案例表面上对比十三个系统配置,我们观察到三种具有部分互补覆盖范围的摆放模式:确定性原语足以应对词法/时间类别,但在规范化任务上失败(标识符混淆测试5%,跨语言测试0%);写入时LLM能够恢复规范化(100%),但无法处理意图感知删除(前缀冲突与复合事实测试0%);变异时钩子能恢复意图感知删除(78-85%),并几乎同时提升所有类别的性能(整体91.7-93.2%,每次385案例运行成本0.17美元,每个案例变异延迟2.3秒,与之相比确定性方案64-191毫秒,回忆路径保持不变)。我们通过ForgetEval揭示这一权衡——该评估工具包含1000个模板化案例套件及385个对抗性层(132个人工编写+253个LLM编写并经验证),采用确定性子串匹配评分,并配有采用诚实N/A评分的六方法适配器协议,使异构记忆存储能以130行代码接入。评估结果经10名标注者间一致性验证(弗莱斯kappa=0.958),并通过77个外部作者案例子集(四位匿名贡献者)复现了规范化不对称性并放大联合摆放的提升效果(+27.8个百分比点)。生产环境中的故障主要是遗忘故障而非回忆故障,但现有基准仅测量回忆能力。ForgetEval及所有适配器均以MIT许可证发布。