LLM-based guardrails have emerged as a highly effective defense against prompt injection and jailbreak attacks in autonomous agents. However, we reveal that the very reasoning and task-following capabilities enabling this protection introduce a novel vulnerability: attackers can inject crafted data to trap the guardrail in extended reasoning loops, effectuating a systematic denial-of-service (DoS) attack. To systematically expose this threat, we design a beam-search optimization framework that crafts natural-language payloads to maximize guardrail reasoning length, utilizing an LLM proposer guided by a strategy bank. Based on the observation of guardrail's schema-following nature, we also provide another attack framework driven by mechanism-aware structural mutations with less computational load. The attack efficacy is systematically evaluated in two parts. First, in standalone evaluations, the attack generalizes across diverse guardrail architectures, safety templates, and agent benchmarks. Payloads optimized on a single open-source surrogate successfully transfer to eight leading model backbones (e.g., Claude, GPT, Gemini, DeepSeek, and Qwen), achieving a 13--63$\times$ token amplification. Second, in end-to-end real-world agent deployments (web, desktop, code, and multi-agent systems), the attack reveals up to a 148$\times$ latency amplification. We show that a single poisoned document can saturate shared guardrail infrastructures, effectively starving co-located agents and paralyzing the entire system. By uncovering this availability flaw, our work underscores the urgent need to develop cost-bounded, reasoning-robust guardrails.
翻译:基于LLM的护栏已成为自主智能体中对抗提示注入和越狱攻击的高效防御机制。然而,我们发现正是这种实现保护的推理与任务遵循能力引入了新的脆弱性:攻击者可通过注入精心设计的数据将护栏困于扩展推理循环中,从而实施系统化的拒绝服务(DoS)攻击。为系统揭示这一威胁,我们设计了波束搜索优化框架,利用策略库引导的LLM提议器生成最大化护栏推理长度的自然语言载荷。基于对护栏模式遵循特性的观察,我们还提出了另一种计算开销更低的机制感知结构变异攻击框架。攻击效能通过两部分系统评估:首先在独立评估中,攻击可泛化至多种护栏架构、安全模板及智能体基准测试。基于单一开源代理优化的载荷成功迁移至八个主流模型主干(如Claude、GPT、Gemini、DeepSeek和Qwen),实现13-63倍的令牌放大;其次在端到端真实智能体部署场景(网络、桌面、代码及多智能体系统)中,攻击可导致高达148倍的延迟放大。研究表明,单个有毒文档即可饱和共享护栏基础设施,有效饥饿同驻智能体并瘫痪整个系统。通过揭示这一可用性缺陷,我们的工作凸显了开发成本有界、推理鲁棒护栏的迫切需求。