Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations

Where an LLM sits in an agent memory pipeline -- between the recall plane that retrieves stored facts (extensively benchmarked) and the control plane that mutates them via supersede, release, purge (largely untested) -- shapes which forgetting failure modes the system recovers. Comparing thirteen system configurations on a 385-case adversarial surface, we observe three placement regimes with partly complementary coverage: deterministic primitives suffice for lexical/temporal categories but fail canonicalization (5% on identifier-obfuscation, 0% on cross-lingual); inscribe-time LLM recovers canonicalization (100%) but cannot help intent-aware deletion (0% on prefix-collision and compound-fact); a mutation-time hook recovers intent-aware deletion (78-85%) and brightens nearly all categories simultaneously (91.7-93.2% overall, $0.17 per 385-case run, 2.3s/case mutation latency vs. 64-191ms/case deterministic, recall path unchanged). We expose the trade-off via ForgetEval, a 1000-case templated suite plus a 385-case adversarial layer (132 hand-crafted + 253 LLM-drafted oracle-validated) scored by deterministic substring match, paired with a six-method Adapter Protocol with honest N/A scoring that lets heterogeneous memory stores enter in 130 lines. Admission is corroborated by 10-annotator IAA (Fleiss' kappa = 0.958) and a 77-case external-authored subset (four blind contributors) that replicates the canonicalization asymmetry and amplifies the joint-placement lift (+27.8 pt). Production failures are predominantly forgetting failures rather than recall failures, yet existing benchmarks measure only recall. ForgetEval and all adapters are released under MIT.

翻译：大语言模型（LLM）在智能体记忆管线中的位置——位于检索存储事实（已被广泛基准测试）的回忆平面与通过替换、释放、清空操作（基本未经测试）对其进行变异的控制平面之间——决定了系统对哪些遗忘故障模式能够恢复。通过在385个对抗性案例表面上对比十三个系统配置，我们观察到三种具有部分互补覆盖范围的摆放模式：确定性原语足以应对词法/时间类别，但在规范化任务上失败（标识符混淆测试5%，跨语言测试0%）；写入时LLM能够恢复规范化（100%），但无法处理意图感知删除（前缀冲突与复合事实测试0%）；变异时钩子能恢复意图感知删除（78-85%），并几乎同时提升所有类别的性能（整体91.7-93.2%，每次385案例运行成本0.17美元，每个案例变异延迟2.3秒，与之相比确定性方案64-191毫秒，回忆路径保持不变）。我们通过ForgetEval揭示这一权衡——该评估工具包含1000个模板化案例套件及385个对抗性层（132个人工编写+253个LLM编写并经验证），采用确定性子串匹配评分，并配有采用诚实N/A评分的六方法适配器协议，使异构记忆存储能以130行代码接入。评估结果经10名标注者间一致性验证（弗莱斯kappa=0.958），并通过77个外部作者案例子集（四位匿名贡献者）复现了规范化不对称性并放大联合摆放的提升效果（+27.8个百分比点）。生产环境中的故障主要是遗忘故障而非回忆故障，但现有基准仅测量回忆能力。ForgetEval及所有适配器均以MIT许可证发布。