Tool-using LLM agents interact with the world through actions that persist state in artifacts (e.g., workspace files or logs). Consequently, jailbreak defenses must reason about cross-step composition rather than isolated text. Yet most existing attacks and defenses, including ``multi-turn'' jailbreaks such as Crescendo and Tree of Attacks,still assume a single contiguous conversation visible to the defender. This assumption breaks down in real agent pipelines, where enforcement is fragmented across tools, modules, and time, and where artifact provenance is often not tracked. We operationalize a deployment failure mode for tool-using LLM agents, the \emph{provenance gap}, and study reproducible triggers for it: \emph{Context-Fractured Decomposition} (CFD), a family of cross-context multi-step jailbreaks that preserve benign-looking intermediate artifacts from an early interaction and elicit harmful behavior much later, potentially in a different agent instance or workflow stage, via individually innocuous tool actions whose risk emerges only under delayed artifact-mediated composition. We instrument the failure mode with trace-level diagnostics and outline a verifiable mitigation direction (provenance lineage tagging). Across agent-system jailbreak benchmarks, CFD improves success rates by up to 28.3 percentage points over state-of-the-art baselines, even against strong single-turn judges. Disclaimer: This paper contains examples of harmful or offensive language.
翻译:工具使用型LLM智能体通过与世界的交互,以工件(如工作空间文件或日志)中持久化状态的动作进行操作。因此,越狱防御必须推理跨步骤的组合,而非孤立的文本。然而,大多数现有的攻击与防御,包括Crescendo和Tree of Attacks等“多轮”越狱方法,仍假设防御者可见一个连续的会话。这一假设在真实的智能体流水线中失效,因为在此场景下,防御措施分散在工具、模块和时间之间,且工件来源通常未被追踪。我们针对工具使用型LLM智能体,将一种部署故障模式——*来源缺口*——操作化,并研究其可复现的触发机制:*上下文碎片化解耦*(CFD),这是一类跨上下文的多步越狱方法,通过早期交互保留看似良性的中间工件,并在稍后时刻(可能在不同智能体实例或工作流阶段)通过单独无害的工具动作诱发有害行为,其风险仅在延迟的工件介导组合下显现。我们通过轨迹级诊断对该故障模式进行仪器化,并概述了一种可验证的缓解方向(来源谱系标记)。在智能体系统越狱基准测试中,CFD相比最先进的基线方法将成功率高提升了多达28.3个百分点,即使面对强大的单轮评判器。免责声明:本文包含有害或冒犯性语言的示例。