Recent work on text diffusion models offers a promising alternative to autoregressive generation, but controlling their safety remains underexplored. Existing safety approaches are geared toward autoregressive models and typically rely on post-hoc filtering or inference-time interventions. These are inadequate for effectively addressing safety risks in text diffusion models. We propose the Safety-Aware Denoiser (SAD), a safety-guidance framework in text diffusion models. The SAD modifies the iterative denoising process such that the text sample at the final denoising step is steered toward provably safe regions of the text space. This inference-time method can integrate safety constraints into the denoiser, avoiding computationally expensive retraining of the underlying diffusion model and enabling flexible, lightweight safety guidance. We evaluate the safety of the generated text using the SAD, with respect to hazard taxonomy, memorization, and jailbreak. Experimental results show that SAD substantially reduces unsafe generations while preserving generation quality, diversity, and fluency, outperforming existing methods. These results demonstrate that our safety guidance during denoising provides an effective and scalable mechanism for enforcing safety in text diffusion models.
翻译:近期关于文本扩散模型的研究为自回归生成提供了一种有前景的替代方案,但对其安全性的控制仍研究不足。现有安全方法主要针对自回归模型,通常依赖于事后过滤或推理时干预。这些方法无法有效应对文本扩散模型中的安全风险。我们提出安全感知去噪器(Safety-Aware Denoiser, SAD),这是一种针对文本扩散模型的安全引导框架。SAD修改了迭代去噪过程,使得最终去噪步骤中的文本样本被引导至文本空间中可证明的安全区域。这种推理时方法能够将安全约束集成到去噪器中,避免了代价高昂的底层扩散模型重训练,并实现了灵活轻量的安全引导。我们通过危害分类、记忆化和越狱攻击三个维度评估了使用SAD生成文本的安全性。实验结果表明,SAD在保持生成质量、多样性和流畅性的同时,显著减少了不安全生成结果,优于现有方法。这些结果证明,我们在去噪过程中的安全引导为文本扩散模型实施安全约束提供了一种有效且可扩展的机制。