Large Language Model (LLM) agents are increasingly being deployed as conversational assistants capable of performing complex real-world tasks through tool integration. This enhanced ability to interact with external systems and process various data sources, while powerful, introduces significant security vulnerabilities. In particular, indirect prompt injection attacks pose a critical threat, where malicious instructions embedded within external data sources can manipulate agents to deviate from user intentions. While existing defenses based on rule constraints, source spotlighting, and authentication protocols show promise, they struggle to maintain robust security while preserving task functionality. We propose a novel and orthogonal perspective that reframes agent security from preventing harmful actions to ensuring task alignment, requiring every agent action to serve user objectives. Based on this insight, we develop Task Shield, a test-time defense mechanism that systematically verifies whether each instruction and tool call contributes to user-specified goals. Through experiments on the AgentDojo benchmark, we demonstrate that Task Shield reduces attack success rates (2.07\%) while maintaining high task utility (69.79\%) on GPT-4o.
翻译:大型语言模型(LLM)智能体正越来越多地被部署为能够通过工具集成执行复杂现实任务的对话助手。这种与外部系统交互并处理多种数据源的增强能力虽然强大,却引入了严重的安全漏洞。特别是间接提示注入攻击构成了关键威胁,其中嵌入在外部数据源中的恶意指令可以操纵智能体偏离用户意图。尽管现有的基于规则约束、来源聚焦和认证协议的防御方法显示出潜力,但它们在保持强大安全性的同时难以维持任务功能。我们提出了一种新颖且正交的视角,将智能体安全从防止有害行为重新定义为确保任务对齐,要求每个智能体行为都必须服务于用户目标。基于这一见解,我们开发了任务护盾,一种在测试时运行的防御机制,能够系统性地验证每条指令和工具调用是否有助于实现用户指定的目标。通过在AgentDojo基准测试上的实验,我们证明任务护盾在GPT-4o上能够显著降低攻击成功率(2.07%),同时保持较高的任务效用(69.79%)。