Motivated by recent advancements in text-to-image diffusion, we study erasure of specific concepts from the model's weights. While Stable Diffusion has shown promise in producing explicit or realistic artwork, it has raised concerns regarding its potential for misuse. We propose a fine-tuning method that can erase a visual concept from a pre-trained diffusion model, given only the name of the style and using negative guidance as a teacher. We benchmark our method against previous approaches that remove sexually explicit content and demonstrate its effectiveness, performing on par with Safe Latent Diffusion and censored training. To evaluate artistic style removal, we conduct experiments erasing five modern artists from the network and conduct a user study to assess the human perception of the removed styles. Unlike previous methods, our approach can remove concepts from a diffusion model permanently rather than modifying the output at the inference time, so it cannot be circumvented even if a user has access to model weights. Our code, data, and results are available at https://erasing.baulab.info/
翻译:受近期文本到图像扩散技术进展的启发,我们研究了从模型权重中擦除特定概念的方法。尽管Stable Diffusion在生成露骨或逼真的艺术作品方面展现出潜力,但也引发了关于其可能被滥用的担忧。我们提出一种微调方法,仅需给定风格名称并利用负向引导作为教师信号,即可从预训练扩散模型中擦除某个视觉概念。我们将该方法与先前移除露骨色情内容的技术进行基准测试,证明其有效性,性能与Safe Latent Diffusion及审查训练相当。为评估艺术风格移除效果,我们开展了从网络中擦除五位现代艺术家风格的实验,并通过用户研究评估人类对移除后风格的感知。与先前方法不同,本方法能永久性地从扩散模型中移除概念,而非在推理阶段修改输出结果,因此即使使用者拥有模型权重也无法绕过该移除操作。我们的代码、数据及结果已发布在https://erasing.baulab.info/。