STAC：当无害工具形成危险链以越狱LLM智能体 (STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents)

As LLMs advance into autonomous agents with tool-use capabilities, they introduce security challenges that extend beyond traditional content-based LLM safety concerns. This paper introduces Sequential Tool Attack Chaining (STAC), a novel multi-turn attack framework that exploits agent tool use. STAC chains together tool calls that each appear harmless in isolation but, when combined, collectively enable harmful operations that only become apparent at the final execution step. We apply our framework to automatically generate and systematically evaluate 483 STAC cases, featuring 1,352 sets of user-agent-environment interactions and spanning diverse domains, tasks, agent types, and 10 failure modes. Our evaluations show that state-of-the-art LLM agents, including GPT-4.1, are highly vulnerable to STAC, with attack success rates (ASR) exceeding 90% in most cases. The core design of STAC's automated framework is a closed-loop pipeline that synthesizes executable multi-step tool chains, validates them through in-environment execution, and reverse-engineers stealthy multi-turn prompts that reliably induce agents to execute the verified malicious sequence. We further perform defense analysis against STAC and find that existing prompt-based defenses provide limited protection. To address this gap, we propose a new reasoning-driven defense prompt that achieves far stronger protection, cutting ASR by up to 28.8%. These results highlight a crucial gap: defending tool-enabled agents requires reasoning over entire action sequences and their cumulative effects, rather than evaluating isolated prompts or responses.

翻译：随着大型语言模型（LLM）发展为具备工具使用能力的自主智能体，其带来的安全挑战已超越传统基于内容的LLM安全问题。本文提出序列化工具攻击链（STAC），一种利用智能体工具使用的新型多轮攻击框架。STAC将多个单独看似无害的工具调用串联起来，当这些调用组合时，其协同作用能在最终执行步骤中实现仅在整体层面显现的有害操作。我们应用该框架自动生成并系统评估了483个STAC案例，涵盖1,352组用户-智能体-环境交互，涉及多领域、多任务、多种智能体类型及10类失效模式。评估结果表明，包括GPT-4.1在内的前沿LLM智能体对STAC攻击高度脆弱，多数场景下攻击成功率（ASR）超过90%。STAC自动化框架的核心设计是一个闭环流程：合成可执行的多步骤工具链，通过环境内执行验证其有效性，并逆向工程生成能可靠诱导智能体执行已验证恶意序列的隐蔽多轮提示。我们进一步针对STAC开展防御分析，发现现有基于提示的防御措施保护效果有限。为弥补这一缺陷，我们提出一种新型推理驱动防御提示，可实现显著更强的防护效果，将ASR降低达28.8%。这些结果揭示了一个关键缺口：防御具备工具使用能力的智能体需要对完整动作序列及其累积效应进行推理，而非孤立评估单个提示或响应。