Tool-augmented Large Language Model (LLM) agents have demonstrated impressive capabilities in automating complex, multi-step real-world tasks, yet remain vulnerable to indirect prompt injection. Adversaries exploit this weakness by embedding malicious instructions within tool-returned content, which agents directly incorporate into their conversation history as trusted observations. This vulnerability manifests across three primary attack channels: web and local content injection, MCP server injection, and skill file injection. To address these vulnerabilities, we introduce \textsc{ClawGuard}, a novel runtime security framework that enforces a user-confirmed rule set at every tool-call boundary, transforming unreliable alignment-dependent defense into a deterministic, auditable mechanism that intercepts adversarial tool calls before any real-world effect is produced. By automatically deriving task-specific access constraints from the user's stated objective prior to any external tool invocation, \textsc{ClawGuard} blocks all three injection pathways without model modification or infrastructure change. Experiments across five state-of-the-art language models on AgentDojo, SkillInject, and MCPSafeBench demonstrate that \textsc{ClawGuard} achieves robust protection against indirect prompt injection without compromising agent utility. This work establishes deterministic tool-call boundary enforcement as an effective defense mechanism for secure agentic AI systems, requiring neither safety-specific fine-tuning nor architectural modification. Code is publicly available at https://github.com/Claw-Guard/ClawGuard.
翻译:工具增强型大型语言模型(LLM)代理在自动化复杂、多步骤的现实世界任务中展现出令人瞩目的能力,但仍易受到间接提示注入攻击。攻击者利用这一弱点,将恶意指令嵌入工具返回的内容中,而代理会将这些内容作为可信观测结果直接纳入其对话历史。该漏洞主要通过三种攻击渠道表现:网络和本地内容注入、MCP服务器注入以及技能文件注入。为解决这些漏洞,我们提出了\textsc{ClawGuard},一种新型运行时安全框架,它在每个工具调用边界强制执行用户确认的规则集,将不可靠的依赖对齐的防御机制转化为确定性、可审计的机制,从而在实际效果产生之前拦截恶意工具调用。通过在任何外部工具调用前自动从用户所述目标中推导出任务特定的访问约束,\textsc{ClawGuard}在不修改模型或更改基础设施的情况下阻止了所有三种注入路径。在AgentDojo、SkillInject和MCPSafeBench上针对五种最先进语言模型的实验表明,\textsc{ClawGuard}在不牺牲代理实用性的前提下实现了对间接提示注入的稳健防护。本工作确立了确定性工具调用边界强制执行作为安全智能代理AI系统的有效防御机制,无需特定安全性微调或架构修改。代码公开于https://github.com/Claw-Guard/ClawGuard。