Large-scale image generation models, with impressive quality made possible by the vast amount of data available on the Internet, raise social concerns that these models may generate harmful or copyrighted content. The biases and harmfulness arise throughout the entire training process and are hard to completely remove, which have become significant hurdles to the safe deployment of these models. In this paper, we propose a method called SDD to prevent problematic content generation in text-to-image diffusion models. We self-distill the diffusion model to guide the noise estimate conditioned on the target removal concept to match the unconditional one. Compared to the previous methods, our method eliminates a much greater proportion of harmful content from the generated images without degrading the overall image quality. Furthermore, our method allows the removal of multiple concepts at once, whereas previous works are limited to removing a single concept at a time.
翻译:大规模图像生成模型凭借互联网海量数据展现出卓越的生成质量,但其可能生成有害或受版权保护内容的特性引发了社会担忧。偏见与危害性贯穿模型训练的整个过程且难以彻底消除,这成为模型安全部署的重大障碍。本文提出一种名为SDD的方法,用于阻止文本到图像扩散模型生成问题内容。该方法通过自蒸馏技术,引导以目标移除概念为条件的噪声估计向无条件噪声估计趋同。与既有方法相比,本方法能在不降低整体图像质量的前提下,更大程度地消除生成图像中的有害内容。此外,本方法支持同时移除多个概念,而现有工作每次仅能移除单一概念。