The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

Zichen Wen,Jiashu Qu,Zhaorun Chen,Xiaoya Lu,Dongrui Liu,Zhiyuan Liu,Ruixi Wu,Yicun Yang,Xiangqi Jin,Haoyun Xu,Xuyang Liu,Weijia Li,Chaochao Lu,Jing Shao,Conghui He,Linfeng Zhang

from arxiv, Accepted by ICLR 2026

Diffusion-based large language models (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs, offering faster inference and greater interactivity via parallel decoding and bidirectional modeling. However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware, masked-input adversarial prompts, exposing novel vulnerabilities. To this end, we present DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs. Specifically, our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs, i.e., bidirectional modeling and parallel decoding. Bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering and rejection sampling of unsafe content. This causes standard alignment mechanisms to fail, enabling harmful completions in alignment-tuned dLLMs, even when harmful behaviors or unsafe instructions are directly exposed in the prompt. Through comprehensive experiments, we demonstrate that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures. Notably, our method achieves up to 100% keyword-based ASR on Dream-Instruct, surpassing the strongest prior baseline, ReNeLLM, by up to 78.5% in evaluator-based ASR on JailbreakBench and by 37.7 points in StrongREJECT score, while requiring no rewriting or hiding of harmful content in the jailbreak prompt. Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models. Code is available at https://github.com/ZichenWen1/DIJA.

翻译：基于扩散的大语言模型（dLLMs）近期作为自回归大语言模型的有力替代方案崭露头角，其通过并行解码和双向建模提供了更快的推理速度和更强的交互性。然而，尽管在代码生成和文本填充任务中表现出色，我们发现了一个根本性的安全问题：现有的对齐机制无法保护dLLMs免受上下文感知的掩码输入对抗性提示的攻击，从而暴露出新的安全漏洞。为此，我们提出了DIJA——首个针对dLLMs独特安全弱点的系统性研究与越狱攻击框架。具体而言，我们提出的DIJA构建了对抗性的交错掩码-文本提示，以利用dLLMs的文本生成机制，即双向建模与并行解码。双向建模会驱动模型为掩码片段生成上下文一致的输出（即使内容有害），而并行解码则限制了模型对不安全内容的动态过滤和拒绝采样能力。这导致标准对齐机制失效，使得经过对齐调优的dLLMs即使在提示中直接暴露有害行为或不安全指令的情况下，仍可能生成有害的补全内容。通过全面的实验，我们证明DIJA在性能上显著优于现有的越狱方法，揭示了dLLM架构中一个先前被忽视的威胁面。值得注意的是，我们的方法在Dream-Instruct数据集上实现了高达100%的关键词攻击成功率，在JailbreakBench上以评估器为基准的攻击成功率比先前最强的基线方法ReNeLLM高出78.5%，在StrongREJECT分数上领先37.7个百分点，且无需在越狱提示中重写或隐藏有害内容。我们的研究结果强调了在这一新兴语言模型类别中重新思考安全对齐机制的迫切性。代码发布于https://github.com/ZichenWen1/DIJA。