Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations

Where an LLM sits in an agent memory pipeline -- between the recall plane that retrieves stored facts (extensively benchmarked) and the control plane that mutates them via supersede, release, purge (largely untested) -- shapes which forgetting failure modes the system recovers. Comparing thirteen system configurations on a 385-case adversarial surface, we observe three placement regimes with partly complementary coverage: deterministic primitives suffice for lexical/temporal categories but fail canonicalization (5% on identifier-obfuscation, 0% on cross-lingual); inscribe-time LLM recovers canonicalization (100%) but cannot help intent-aware deletion (0% on prefix-collision and compound-fact); a mutation-time hook recovers intent-aware deletion (78-85%) and brightens nearly all categories simultaneously (91.7-93.2% overall, $0.17 per 385-case run, 2.3s/case mutation latency vs. 64-191ms/case deterministic, recall path unchanged). We expose the trade-off via ForgetEval, a 1000-case templated suite plus a 385-case adversarial layer (132 hand-crafted + 253 LLM-drafted oracle-validated) scored by deterministic substring match, paired with a six-method Adapter Protocol with honest N/A scoring that lets heterogeneous memory stores enter in 130 lines. Admission is corroborated by 10-annotator IAA (Fleiss' kappa = 0.958) and a 77-case external-authored subset (four blind contributors) that replicates the canonicalization asymmetry and amplifies the joint-placement lift (+27.8 pt). Production failures are predominantly forgetting failures rather than recall failures, yet existing benchmarks measure only recall. ForgetEval and all adapters are released under MIT.

翻译：大型语言模型在智能体记忆流水线中的位置——位于检索存储事实（已被广泛基准测试）的召回平面与通过替代、释放、清除操作对其进行修改（基本未经测试）的控制平面之间——决定了系统能够恢复哪些遗忘失效模式。通过比较385种对抗性场景下的十三种系统配置，我们观察到三种具有部分互补覆盖范围的位置分区：确定性原语足以应对词汇/时间类别，但在规范化处理上失效（标识符混淆场景5%，跨语言场景0%）；写入时嵌入的LLM能恢复规范化处理（100%），但无法处理意图感知删除（前缀冲突与复合事实场景0%）；变异时嵌入的钩子能恢复意图感知删除（78-85%），同时几乎全面提升所有类别的性能（总体91.7-93.2%，每385场景运行成本0.17美元，变异延迟2.3秒/场景对比确定性方法的64-191毫秒/场景，召回路径保持不变）。我们通过ForgetEval揭示了这种权衡——该工具包含1000个模板化场景套件和385个对抗性场景层（132个人工编写+253个LLM生成并经Oracle验证），采用确定性子串匹配评分，并配备包含六种方法的适配器协议与诚实N/A评分机制，使异构记忆存储能在130行代码内接入。该方案经10名标注者的IAA验证（Fleiss' kappa=0.958）和77个外部生成场景（四名匿名贡献者）测试，复现了非对称规范化问题并放大了联合位置策略的提升效果（+27.8个百分点）。生产环境中的失败主要是遗忘故障而非召回故障，但现有基准仅测量召回能力。ForgetEval及其所有适配器均以MIT协议开源发布。