Memory Poisoning Attack and Defense on Memory Based LLM-Agents

Large language model agents equipped with persistent memory are vulnerable to memory poisoning attacks, where adversaries inject malicious instructions through query only interactions that corrupt the agents long term memory and influence future responses. Recent work demonstrated that the MINJA (Memory Injection Attack) achieves over 95 % injection success rate and 70 % attack success rate under idealized conditions. However, the robustness of these attacks in realistic deployments and effective defensive mechanisms remain understudied. This work addresses these gaps through systematic empirical evaluation of memory poisoning attacks and defenses in Electronic Health Record (EHR) agents. We investigate attack robustness by varying three critical dimensions: initial memory state, number of indication prompts, and retrieval parameters. Our experiments on GPT-4o-mini, Gemini-2.0-Flash and Llama-3.1-8B-Instruct models using MIMIC-III clinical data reveal that realistic conditions with pre-existing legitimate memories dramatically reduce attack effectiveness. We then propose and evaluate two novel defense mechanisms: (1) Input/Output Moderation using composite trust scoring across multiple orthogonal signals, and (2) Memory Sanitization with trust-aware retrieval employing temporal decay and pattern-based filtering. Our defense evaluation reveals that effective memory sanitization requires careful trust threshold calibration to prevent both overly conservative rejection (blocking all entries) and insufficient filtering (missing subtle attacks), establishing important baselines for future adaptive defense mechanisms. These findings provide crucial insights for securing memory-augmented LLM agents in production environments.

翻译：配备持久性记忆的大语言模型智能体易受记忆投毒攻击，攻击者仅通过查询交互注入恶意指令，即可污染智能体的长期记忆并影响其未来响应。近期研究表明，MINJA（记忆注入攻击）在理想条件下可实现超过95%的注入成功率和70%的攻击成功率。然而，这些攻击在实际部署中的鲁棒性及有效防御机制仍未得到充分研究。本研究通过对电子健康记录（EHR）智能体中的记忆投毒攻击与防御进行系统性实证评估，以填补这些空白。我们通过改变三个关键维度来探究攻击的鲁棒性：初始记忆状态、诱导提示数量以及检索参数。我们在GPT-4o-mini、Gemini-2.0-Flash和Llama-3.1-8B-Instruct模型上使用MIMIC-III临床数据进行的实验表明，在存在预先合法记忆的现实条件下，攻击效果会显著降低。随后，我们提出并评估了两种新颖的防御机制：（1）基于多路正交信号的复合信任评分的输入/输出审核；（2）采用时间衰减与模式过滤的信任感知检索式记忆净化。防御评估表明，有效的记忆净化需要精细的信任阈值校准，以避免过度保守的拒绝（屏蔽所有条目）和过滤不足（遗漏隐蔽攻击），从而为未来自适应防御机制建立了重要基线。这些发现为在生产环境中保护记忆增强型LLM智能体提供了关键见解。