AI agents, predominantly powered by large language models (LLMs), are vulnerable to indirect prompt injection, in which malicious instructions embedded in untrusted data can trigger dangerous agent actions. This position paper discusses our vision for system-level defenses against indirect prompt injection attacks. We articulate three positions: (1) dynamic replanning and security policy updates are often necessary for dynamic tasks and realistic environments; (2) certain context-dependent security decisions would still require LLMs (or other learned models), but should only be made within system designs that strictly constrain what the model can observe and decide; (3) in inherently ambiguous cases, personalization and human interaction should be treated as core design considerations. In addition to our main positions, we discuss limitations of existing benchmarks that can create a false sense of utility and security. We also highlight the value of system-level defenses, which serve as the skeleton of agentic systems by structuring and controlling agent behaviors, integrating rule-based and model-based security checks, and enabling more targeted research on model robustness and human interaction.
翻译:AI代理(主要由大型语言模型驱动)容易受到间接提示注入攻击,即嵌入不可信数据中的恶意指令可能触发危险代理行为。本文立场论文阐述了我们关于间接提示注入攻击系统级防御的愿景。我们提出三个核心观点:(1)动态重新规划和安全策略更新对于动态任务和真实环境往往是必要的;(2)某些依赖上下文的 安全决策仍需依赖大型语言模型(或其他学习模型),但必须在严格约束模型观察与决策范围的系统设计中执行;(3)对于固有歧义场景,个性化与人类交互应作为核心设计考量。除主要观点外,我们讨论了现有基准测试的局限性——这些测试可能造成功效与安全性的虚假认知。我们还强调系统级防御的价值:其作为代理系统的骨架,通过结构化控制代理行为、整合规则与模型双重安全校验机制,推动模型鲁棒性与人类交互的针对性研究。