FragFuse: Bypassing Access Control of Large Language Model Agents via Memory-Based Query Fragmentation and Fusion

Large language model (LLM) agents increasingly rely on long-term memory to support complex task execution, user personalization, and domain adaptation. Meanwhile, emerging access-control mechanisms for LLM agents are being explored to block policy-violating requests and prevent misuse. We reveal a novel attack surface arising from agent memory operations: prohibited content that would trigger access control can be fragmented across interactions, stored in long-term memory in benign-appearing form, and later reconstructed through memory retrieval without appearing explicitly in the final user query. We propose FragFuse, the first attack that enables unprivileged users to bypass agent access control by exploiting this temporal channel introduced by long-term memory. FragFuse operates in three stages: (1) identifying rejection-responsive fragments via black-box adaptive querying with fragment masking; (2) injecting these fragments into memory using marker carrier queries; and (3) retrieving and fusing the stored fragments through a follow-up attack query. Although FragFuse can be instantiated manually for individual agents, we further develop a surrogate-based optimization scheme that tunes fusion instructions and marker designs, enabling automated attack generation without violating the attacker's threat-model assumptions. We evaluate FragFuse across four representative agent settings and task domains, covering three state-of-the-art agent access-control mechanisms. FragFuse achieves an average bypass success rate of 86.3% and an average end-to-end harmful task success rate of 41.1% across all settings, with only 4.4% average task-success degradation compared with configurations without access control. We also show that alternative defenses, including state-of-the-art prompt-injection detectors and perplexity detectors, do not effectively address this attack.

翻译：大型语言模型（LLM）智能体日益依赖长期记忆来支持复杂任务执行、用户个性化与领域自适应。与此同时，针对LLM智能体的新兴访问控制机制正在被探索，以阻断违反策略的请求并防止滥用。我们揭示了一种由智能体记忆操作引发的新型攻击面：本会触发访问控制的违禁内容可被跨交互碎片化，以看似良性的形式存储于长期记忆中，随后通过记忆检索重构，而无需在最终用户查询中显式出现。我们提出FragFuse——首个利用长期记忆引入的时域通道，使未授权用户得以绕过智能体访问控制的攻击方法。FragFuse包含三个阶段：（1）通过基于片段掩蔽的黑盒自适应查询识别可触发拒绝响应的文本片段；（2）使用标记载体查询将这些片段注入记忆；（3）通过后续攻击查询检索并融合存储的片段。尽管FragFuse可针对单一智能体进行手动实例化，我们进一步开发了一种基于替代模型的优化方案，可调整融合指令与标记设计，在无需违反攻击者威胁模型假设的条件下实现自动化攻击生成。我们在四个代表性智能体设置与任务领域上评估了FragFuse，涵盖三种最先进的智能体访问控制机制。在所有设置下，FragFuse的平均绕过成功率达到86.3%，平均端到端有害任务成功率为41.1%，相较于无访问控制配置仅产生4.4%的平均任务成功降解率。我们还表明，包括最先进的提示注入检测器与困惑度检测器在内的替代防御方案无法有效应对此攻击。