Indirect prompt injection threatens LLM agents by embedding malicious instructions in external content, enabling unauthorized actions and data theft. LLM agents maintain working memory through their context window, which stores interaction history for decision-making. Conventional agents indiscriminately accumulate all tool outputs and reasoning traces in this memory, creating two critical vulnerabilities: (1) injected instructions persist throughout the workflow, granting attackers multiple opportunities to manipulate behavior, and (2) verbose, non-essential content degrades decision-making capabilities. Existing defenses treat bloated memory as given and focus on remaining resilient, rather than reducing unnecessary accumulation to prevent the attack. We present AgentSys, a framework that defends against indirect prompt injection through explicit memory management. Inspired by process memory isolation in operating systems, AgentSys organizes agents hierarchically: a main agent spawns worker agents for tool calls, each running in an isolated context and able to spawn nested workers for subtasks. External data and subtask traces never enter the main agent's memory; only schema-validated return values can cross boundaries through deterministic JSON parsing. Ablations show isolation alone cuts attack success to 2.19%, and adding a validator/sanitizer further improves defense with event-triggered checks whose overhead scales with operations rather than context length. On AgentDojo and ASB, AgentSys achieves 0.78% and 4.25% attack success while slightly improving benign utility over undefended baselines. It remains robust to adaptive attackers and across multiple foundation models, showing that explicit memory management enables secure, dynamic LLM agent architectures. Our code is available at: https://github.com/ruoyaow/agentsys-memory.
翻译:间接提示注入通过将恶意指令嵌入外部内容来威胁LLM智能体,可能导致未授权操作和数据窃取。LLM智能体通过其上下文窗口维护工作记忆,该窗口存储交互历史以支持决策。传统智能体不加区分地将所有工具输出和推理轨迹累积于此内存中,从而产生两个关键漏洞:(1) 注入的指令在整个工作流程中持续存在,使攻击者获得多次操纵行为的机会;(2) 冗长非必要的内容会降低决策能力。现有防御方法将臃肿的内存视为既定事实,侧重于保持韧性,而非通过减少非必要累积来预防攻击。本文提出AgentSys框架,通过显式内存管理防御间接提示注入。受操作系统进程内存隔离机制启发,AgentSys采用分层方式组织智能体:主智能体为工具调用生成工作智能体,每个工作智能体在隔离上下文中运行,并可进一步为子任务生成嵌套工作智能体。外部数据与子任务轨迹永不进入主智能体内存;仅通过确定性JSON解析、经模式验证的返回值可跨越边界。消融实验表明,仅隔离机制即可将攻击成功率降至2.19%,而添加验证器/清理器配合事件触发检查能进一步提升防御效果,其开销随操作次数而非上下文长度增长。在AgentDojo与ASB基准测试中,AgentSys分别实现0.78%和4.25%的攻击成功率,同时在良性任务效用上较无防御基线略有提升。该系统对自适应攻击及多种基础模型均保持鲁棒性,证明显式内存管理能够实现安全动态的LLM智能体架构。代码已开源:https://github.com/ruoyaow/agentsys-memory。