Motivated by recent advancements in text-to-image diffusion, we study erasure of specific concepts from the model's weights. While Stable Diffusion has shown promise in producing explicit or realistic artwork, it has raised concerns regarding its potential for misuse. We propose a fine-tuning method that can erase a visual concept from a pre-trained diffusion model, given only the name of the style and using negative guidance as a teacher. We benchmark our method against previous approaches that remove sexually explicit content and demonstrate its effectiveness, performing on par with Safe Latent Diffusion and censored training. To evaluate artistic style removal, we conduct experiments erasing five modern artists from the network and conduct a user study to assess the human perception of the removed styles. Unlike previous methods, our approach can remove concepts from a diffusion model permanently rather than modifying the output at the inference time, so it cannot be circumvented even if a user has access to model weights. Our code, data, and results are available at https://erasing.baulab.info/
翻译:受近期文本到图像扩散技术进展的启发,我们研究从模型权重中擦除特定概念的方法。虽然稳定扩散(Stable Diffusion)在生成露骨或逼真艺术作品方面展现出潜力,但也引发了对其潜在滥用的担忧。我们提出一种微调方法,仅需给定风格名称,并利用负向引导作为教师信号,即可从预训练扩散模型中擦除视觉概念。我们将该方法与先前移除色情内容的方案进行基准测试,并证明其有效性,表现与安全潜在扩散(Safe Latent Diffusion)及审查训练相当。为评估艺术风格移除效果,我们开展实验从网络中擦除五位现代艺术家的风格,并通过用户研究评估人类对移除风格的感知。与先前方法不同,我们的方法可永久性地从扩散模型中移除概念,而非在推理阶段修改输出,因此即便用户有权访问模型权重也无法规避。我们的代码、数据及结果详见https://erasing.baulab.info/。