Memory is a core component of AI agents, enabling them to accumulate knowledge across interactions and improve performance. However, persistent memory introduces the risk of memory poisoning, where a single adversarial memory write can exert long-term influence over agent behavior. We present a systematic study of memory poisoning in LLM-based agents. We identify four memory write channels and nine structural vulnerabilities in model capabilities, system prompt design, and agent system architecture that make these channels exploitable. Based on these vulnerabilities, we develop a taxonomy of six classes of memory poisoning attacks. Furthermore, we design MPBench -- a benchmark for evaluating memory poisoning attacks, and show that agents designed to write and retrieve memory more aggressively are more exploitable. We also show that existing prompt injection defenses fail to cover memory poisoning attacks. Our findings provide a foundation for understanding and mitigating memory poisoning attacks against AI agents.
翻译:记忆是AI代理的核心组件,使其能够跨交互积累知识并提升性能。然而,持久性记忆带来了记忆中毒的风险——单次对抗性记忆写入即可对代理行为产生长期影响。我们对基于LLM的代理中的记忆中毒进行了系统性研究,识别出四个记忆写入通道以及模型能力、系统提示设计和代理系统架构中的九类结构性漏洞,这些漏洞使得上述通道可被利用。基于这些漏洞,我们建立了一个包含六类记忆中毒攻击的分类体系。此外,我们设计了MPBench——一个用于评估记忆中毒攻击的基准测试,并证明设计为更积极进行记忆写入和检索的代理更具可攻击性。我们还发现现有提示注入防御措施无法覆盖记忆中毒攻击。我们的研究为理解和防御针对AI代理的记忆中毒攻击奠定了基础。