Large Language Model (LLM)-based agents employ external and internal memory systems to handle complex, goal-oriented tasks, yet this exposes them to severe extraction attacks, and effective defenses remain lacking. In this paper, we propose MemPot, the first theoretically verified defense framework against memory extraction attacks by injecting optimized honeypots into the memory. Through a two-stage optimization process, MemPot generates trap documents that maximize the retrieval probability for attackers while remaining inconspicuous to benign users. We model the detection process as Wald's Sequential Probability Ratio Test (SPRT) and theoretically prove that MemPot achieves a lower average number of sampling rounds compared to optimal static detectors. Empirically, MemPot significantly outperforms state-of-the-art baselines, achieving a 50% improvement in detection AUROC and an 80% increase in True Positive Rate under low False Positive Rate constraints. Furthermore, our experiments confirm that MemPot incurs zero additional online inference latency and preserves the agent's utility on standard tasks, verifying its superiority in safety, harmlessness, and efficiency.
翻译:基于大语言模型(LLM)的智能体采用外部和内部记忆系统来处理复杂的、面向目标的任务,但这使其面临严重的提取攻击,且目前仍缺乏有效的防御手段。本文提出MemPot,首个通过向内存中注入优化蜜罐来防御内存提取攻击、并经理论验证的防御框架。通过两阶段优化过程,MemPot生成能够最大化攻击者检索概率、同时对良性用户保持隐蔽的陷阱文档。我们将检测过程建模为沃尔德序贯概率比检验(SPRT),并从理论上证明MemPot相比最优静态检测器实现了更低的平均采样轮数。实证结果表明,MemPot显著优于现有先进基线方法,在低误报率约束下,检测AUROC提升50%,真阳性率提高80%。此外,实验证实MemPot不会引入额外的在线推理延迟,并在标准任务上保持了智能体的效用,验证了其在安全性、无害性和效率方面的优越性。