LLM-based Agents are becoming increasingly capable and widely deployed, creating growing incentives for adversarial misuse in the real-world. A key emerging threat is Decomposition Attacks \cite{glukhov2024breach, jones2024adversaries} in which a harmful task is broken into simpler, benign subtasks that evade safety mechanisms when executed separately but cumulatively fulfill the malicious intent. Although recent benchmarks assess agent safety in multi-turn and multi-tool-use settings, they do not explicitly capture this form of decompositional misuse and may not represent realistic adversarial execution flows. To this end, we introduce DeCompBench, a benchmark designed specifically to evaluate agentic safety under decomposition attacks. DeCompBench is created with a decomposition-by-design principle using a graphical framework and enables harmful task decomposition into individually benign and executable subtasks with realistic workflows. Our experiments using a custom decomposer show that state-of-the-art agents exhibit high refusal rates on monolithic harmful tasks, but significantly lower refusal rates on their decomposed variants, while often inadvertently fulfilling the adversarial objectives. These findings underscore the need for safety evaluations against decomposition attacks and corresponding defenses. Our dataset is publicly available and can be found at https://huggingface.co/datasets/decompositionbench/DeCompBench.
翻译:基于大语言模型(LLM)的智能体能力日益增强且应用广泛,这引发了现实世界中对抗性误用的风险。一种新兴的关键威胁是分解攻击\cite{glukhov2024breach, jones2024adversaries},即通过将有害任务分解为更简单、无害的子任务,这些子任务在单独执行时可规避安全机制,但累积起来却能实现恶意意图。尽管近期基准测试在多轮交互和多工具使用场景下评估了智能体安全性,但并未明确涵盖此类分解性误用,也无法代表真实的对抗性执行流程。为此,我们提出DeCompBench——一个专门用于评估分解攻击下智能体安全性的基准测试。DeCompBench采用“分解为设计原则”方法,基于图形化框架构建,能够将有害任务拆解为具有真实工作流的独立无害子任务。使用自定义分解器进行的实验表明,最先进的智能体对完整有害任务的拒绝率较高,但其分解变体的拒绝率显著降低,且常不经意间实现对抗目标。这些发现凸显了针对分解攻击的安全性评估及相应防御的必要性。我们的数据集已公开,可通过https://huggingface.co/datasets/decompositionbench/DeCompBench访问。