Diffusion large language models (D-LLMs) offer an alternative to autoregressive LLMs (AR-LLMs) and have demonstrated advantages in generation efficiency. Beyond the utility benefits, we argue that D-LLMs exhibit a previously underexplored safety blessing: their diffusion-style generation confers intrinsic robustness against jailbreak attacks originally designed for AR-LLMs. In this work, we provide an initial analysis of the underlying mechanism, showing that the diffusion trajectory induces a stepwise reduction effect that progressively suppresses unsafe generations. This robustness, however, is not absolute. Following this analysis, we highlight a simple yet effective failure mode, context nesting, in which harmful requests are embedded within structured benign contexts. Empirically, we show that this simple black-box strategy bypasses D-LLMs' safety blessing, achieving state-of-the-art attack success rates across models and benchmarks. Notably, it enables the first successful jailbreak of Gemini Diffusion to our knowledge, exposing a critical vulnerability in proprietary D-LLMs. Together, our results characterize both the origins and the limits of D-LLMs' safety blessing, constituting an early-stage red-teaming of D-LLMs.
翻译:扩散式大语言模型(D-LLMs)为自回归大语言模型(AR-LLMs)提供了替代方案,并在生成效率方面展现出优势。除效用提升外,我们认为D-LLMs还具备此前未被充分探索的安全优势:其扩散式生成方式天然具备对原本针对AR-LLMs的越狱攻击的鲁棒性。本文中,我们对潜在机制进行了初步分析,表明扩散轨迹会引发逐步抑制效应,从而逐步压制不安全生成内容。然而,这种鲁棒性并非绝对。基于此分析,我们揭示了一种简单而有效的失效模式——上下文嵌套,即有害请求被嵌入结构化良性上下文中。实验表明,这种简单的黑盒策略能够突破D-LLMs的安全优势,在多个模型和基准测试中达到目前最先进的攻击成功率。值得注意的是,据我们所知,该方法首次成功实现了对Gemini Diffusion的越狱,暴露了专有D-LLMs的关键漏洞。综合而言,我们的研究结果刻画了D-LLMs安全优势的来源与边界,构成对D-LLMs的早期红队测试。