The increasing adoption of LLM agents with access to numerous tools and sensitive data significantly widens the attack surface for indirect prompt injections. Due to the context-dependent nature of attacks, however, current defenses are often ill-calibrated as they cannot reliably differentiate malicious and benign instructions, leading to high false positive rates that prevent their real-world adoption. To address this, we present a novel approach inspired by the fundamental principle of computer security: data should not contain executable instructions. Instead of sample-level classification, we propose a token-level sanitization process, which surgically removes any instructions directed at AI systems from tool outputs, capturing malicious instructions as a byproduct. In contrast to existing safety classifiers, this approach is non-blocking, does not require calibration, and is agnostic to the context of tool outputs. Further, we can train such token-level predictors with readily available instruction-tuning data only, and don't have to rely on unrealistic prompt injection examples from challenges or of other synthetic origin. In our experiments, we find that this approach generalizes well across a wide range of attacks and benchmarks like AgentDojo, BIPIA, InjecAgent, ASB and SEP, achieving a 7-10x reduction of attack success rate (ASR) (34% to 3% on AgentDojo), without impairing agent utility in both benign and malicious settings.
翻译:随着能够访问众多工具和敏感数据的LLM代理日益普及,间接提示注入的攻击面显著扩大。然而,由于攻击具有上下文依赖性,现有防御方案往往校准不佳,无法可靠区分恶意与良性指令,导致高误报率而难以实际部署。为此,我们提出一种受计算机安全基本原则启发的创新方法:数据不应包含可执行指令。我们摒弃样本级分类,提出一种词元级净化流程,能够精准移除工具输出中所有针对AI系统的指令,并以此捕获恶意指令。与现有安全分类器相比,该方法具备非阻塞特性,无需校准过程,且对工具输出的上下文保持不可知性。此外,我们仅需使用现成的指令调优数据即可训练此类词元级预测器,无需依赖挑战赛中不现实的提示注入示例或其他合成数据。实验表明,该方法在AgentDojo、BIPIA、InjecAgent、ASB和SEP等广泛攻击场景与基准测试中均表现出良好的泛化能力,在保持良性与恶意环境下代理功能不受损害的前提下,将攻击成功率(ASR)降低了7-10倍(在AgentDojo上从34%降至3%)。