Where an LLM sits in an agent memory pipeline -- between the recall plane that retrieves stored facts (extensively benchmarked) and the control plane that mutates them via supersede, release, purge (largely untested) -- shapes which forgetting failure modes the system recovers. Comparing thirteen system configurations on a 385-case adversarial surface, we observe three placement regimes with partly complementary coverage: deterministic primitives suffice for lexical/temporal categories but fail canonicalization (5% on identifier-obfuscation, 0% on cross-lingual); inscribe-time LLM recovers canonicalization (100%) but cannot help intent-aware deletion (0% on prefix-collision and compound-fact); a mutation-time hook recovers intent-aware deletion (78-85%) and brightens nearly all categories simultaneously (91.7-93.2% overall, $0.17 per 385-case run, 2.3s/case mutation latency vs. 64-191ms/case deterministic, recall path unchanged). We expose the trade-off via ForgetEval, a 1000-case templated suite plus a 385-case adversarial layer (132 hand-crafted + 253 LLM-drafted oracle-validated) scored by deterministic substring match, paired with a six-method Adapter Protocol with honest N/A scoring that lets heterogeneous memory stores enter in 130 lines. Admission is corroborated by 10-annotator IAA (Fleiss' kappa = 0.958) and a 77-case external-authored subset (four blind contributors) that replicates the canonicalization asymmetry and amplifies the joint-placement lift (+27.8 pt). Production failures are predominantly forgetting failures rather than recall failures, yet existing benchmarks measure only recall. ForgetEval and all adapters are released under MIT.
翻译:大型语言模型在智能体记忆流水线中的位置——位于检索存储事实(已被广泛基准测试)的召回平面与通过替代、释放、清除操作对其进行修改(基本未经测试)的控制平面之间——决定了系统能够恢复哪些遗忘失效模式。通过比较385种对抗性场景下的十三种系统配置,我们观察到三种具有部分互补覆盖范围的位置分区:确定性原语足以应对词汇/时间类别,但在规范化处理上失效(标识符混淆场景5%,跨语言场景0%);写入时嵌入的LLM能恢复规范化处理(100%),但无法处理意图感知删除(前缀冲突与复合事实场景0%);变异时嵌入的钩子能恢复意图感知删除(78-85%),同时几乎全面提升所有类别的性能(总体91.7-93.2%,每385场景运行成本0.17美元,变异延迟2.3秒/场景对比确定性方法的64-191毫秒/场景,召回路径保持不变)。我们通过ForgetEval揭示了这种权衡——该工具包含1000个模板化场景套件和385个对抗性场景层(132个人工编写+253个LLM生成并经Oracle验证),采用确定性子串匹配评分,并配备包含六种方法的适配器协议与诚实N/A评分机制,使异构记忆存储能在130行代码内接入。该方案经10名标注者的IAA验证(Fleiss' kappa=0.958)和77个外部生成场景(四名匿名贡献者)测试,复现了非对称规范化问题并放大了联合位置策略的提升效果(+27.8个百分点)。生产环境中的失败主要是遗忘故障而非召回故障,但现有基准仅测量召回能力。ForgetEval及其所有适配器均以MIT协议开源发布。