Content Warning: This paper may contain unsafe or harmful content generated by LLMs that may be offensive to readers. Large Language Models (LLMs) are extensively used as tooling platforms through structured output APIs to ensure syntax compliance so that robust integration with existing software, like agent systems, can be achieved. However, the feature enabling the functionality of grammar-guided structured output presents significant security vulnerabilities. In this work, we reveal a critical control-plane attack surface orthogonal to traditional data-plane vulnerabilities. We introduce Constrained Decoding Attack (CDA), a novel jailbreak class that weaponizes structured output constraints to bypass both external auditing and internal safety alignment. Unlike prior attacks focused on input prompt designs, CDA operates by embedding malicious intent in schema-level grammar rules (control-plane) while maintaining benign surface prompts (data-plane). We instantiate this with two proof-of-concept attacks: EnumAttack, which embeds malicious content in enum fields; and the more evasive DictAttack, which decouples the malicious payload across a benign prompt and a dictionary-based grammar. Our evaluation spans a broad spectrum of 13 proprietary/open-weight models. In particular, DictAttack achieves 94.3--99.5% ASR across five benchmarks on gpt-5, gemini-2.5-pro, deepseek-r1, and gpt-oss-120b. Furthermore, we demonstrate the significant challenge in defending against these threats: while basic grammar auditing mitigates EnumAttack, the more sophisticated DictAttack maintains a 75.8% ASR even against multiple state-of-the-art jailbreak guardrails. This exposes a critical "semantic gap" in current safety architectures and underscores the urgent need for cross-plane defenses that can bridge the data and control planes to secure the LLM generation pipeline.
翻译:内容警告:本文可能包含由大语言模型生成的不安全或有害内容,可能对读者造成冒犯。大语言模型(LLMs)通过结构化输出API被广泛用作工具平台,以确保语法合规性,从而实现与现有软件(如智能体系统)的稳健集成。然而,支持语法引导结构化输出的功能存在显著的安全漏洞。在本研究中,我们揭示了一个与传统数据平面漏洞正交的关键控制平面攻击面。我们提出了约束解码攻击(CDA),这是一种新型的越狱类别,它利用结构化输出约束来绕过外部审计和内部安全对齐。与以往专注于输入提示设计的攻击不同,CDA通过在模式级语法规则(控制平面)中嵌入恶意意图,同时保持良性的表层提示(数据平面)来实施攻击。我们通过两个概念验证攻击实例化了这一方法:EnumAttack,它将恶意内容嵌入枚举字段;以及更具隐蔽性的DictAttack,它将恶意负载解耦到良性提示和基于字典的语法中。我们的评估涵盖了13个专有/开源模型的广泛范围。特别是,DictAttack在gpt-5、gemini-2.5-pro、deepseek-r1和gpt-oss-120b上的五个基准测试中实现了94.3%至99.5%的攻击成功率。此外,我们证明了防御这些威胁的重大挑战:虽然基本的语法审计可以缓解EnumAttack,但更复杂的DictAttack即使面对多个最先进的越狱防护措施,仍能保持75.8%的攻击成功率。这暴露了当前安全架构中存在关键的“语义鸿沟”,并强调了迫切需要能够桥接数据平面和控制平面的跨平面防御机制,以保障大语言模型生成管道的安全。