Recent advances in diffusion models have significantly enhanced their ability to generate high-quality images and videos, but they have also increased the risk of producing unsafe content. Existing unlearning/editing-based methods for safe generation remove harmful concepts from models but face several challenges: (1) They cannot instantly remove harmful concepts without training. (2) Their safe generation capabilities depend on collected training data. (3) They alter model weights, risking degradation in quality for content unrelated to toxic concepts. To address these, we propose SAFREE, a novel, training-free approach for safe T2I and T2V, that does not alter the model's weights. Specifically, we detect a subspace corresponding to a set of toxic concepts in the text embedding space and steer prompt embeddings away from this subspace, thereby filtering out harmful content while preserving intended semantics. To balance the trade-off between filtering toxicity and preserving safe concepts, SAFREE incorporates a novel self-validating filtering mechanism that dynamically adjusts the denoising steps when applying the filtered embeddings. Additionally, we incorporate adaptive re-attention mechanisms within the diffusion latent space to selectively diminish the influence of features related to toxic concepts at the pixel level. In the end, SAFREE ensures coherent safety checking, preserving the fidelity, quality, and safety of the output. SAFREE achieves SOTA performance in suppressing unsafe content in T2I generation compared to training-free baselines and effectively filters targeted concepts while maintaining high-quality images. It also shows competitive results against training-based methods. We extend SAFREE to various T2I backbones and T2V tasks, showcasing its flexibility and generalization. SAFREE provides a robust and adaptable safeguard for ensuring safe visual generation.
翻译:近期扩散模型的进展显著提升了生成高质量图像与视频的能力,但同时也增加了产生不安全内容的风险。现有的基于遗忘/编辑的安全生成方法虽然能够从模型中移除有害概念,但仍面临以下挑战:(1) 无法在不进行训练的情况下即时移除有害概念;(2) 其安全生成能力依赖于收集的训练数据;(3) 这些方法会修改模型权重,可能导致与有害概念无关的内容生成质量下降。为解决这些问题,我们提出了SAFREE——一种新颖的、无需训练且不改变模型权重的安全文本到图像(T2I)与文本到视频(T2V)生成方法。具体而言,我们在文本嵌入空间中检测对应于一组有害概念的子空间,并将提示词嵌入引导远离该子空间,从而在保留预期语义的同时过滤有害内容。为平衡过滤有害内容与保留安全概念之间的权衡,SAFREE引入了一种新颖的自验证过滤机制,该机制在应用过滤后的嵌入时动态调整去噪步骤。此外,我们在扩散潜空间中引入了自适应重注意力机制,以在像素级别选择性地削弱与有害概念相关的特征影响。最终,SAFREE确保了连贯的安全检查,保持了输出结果的保真度、质量与安全性。与无需训练的基线方法相比,SAFREE在抑制T2I生成中的不安全内容方面达到了最先进的性能,并在保持高质量图像的同时有效过滤了目标概念。相较于基于训练的方法,SAFREE也展现出具有竞争力的结果。我们将SAFREE扩展至多种T2I骨干模型及T2V任务,展示了其灵活性与泛化能力。SAFREE为保障安全的视觉生成提供了一个鲁棒且自适应的防护机制。