Diffusion-based models have gained significant popularity for text-to-image generation due to their exceptional image-generation capabilities. A risk with these models is the potential generation of inappropriate content, such as biased or harmful images. However, the underlying reasons for generating such undesired content from the perspective of the diffusion model's internal representation remain unclear. Previous work interprets vectors in an interpretable latent space of diffusion models as semantic concepts. However, existing approaches cannot discover directions for arbitrary concepts, such as those related to inappropriate concepts. In this work, we propose a novel self-supervised approach to find interpretable latent directions for a given concept. With the discovered vectors, we further propose a simple approach to mitigate inappropriate generation. Extensive experiments have been conducted to verify the effectiveness of our mitigation approach, namely, for fair generation, safe generation, and responsible text-enhancing generation. Project page: \url{https://interpretdiffusion.github.io}.
翻译:基于扩散的模型因其卓越的图像生成能力在文本到图像生成领域广受欢迎。这些模型的一个潜在风险是可能生成不当内容,例如有偏见或有害的图像。然而,从扩散模型内部表示的角度来看,生成此类不当内容的根本原因仍不明确。先前研究将扩散模型可解释潜空间中的向量解释为语义概念,但现有方法无法发现任意概念(如与不当概念相关)的方向。在本文中,我们提出了一种新颖的自监督方法,用于发现给定概念的可解释潜在方向。利用发现的向量,我们进一步提出了一种缓解不当内容生成的简单方法。大量实验验证了我们缓解方法的有效性,包括公平生成、安全生成以及负责任的文本增强生成。项目页面:\url{https://interpretdiffusion.github.io}。