While large-scale text-to-image diffusion models have demonstrated impressive image-generation capabilities, there are significant concerns about their potential misuse for generating unsafe content, violating copyright, and perpetuating societal biases. Recently, the text-to-image generation community has begun addressing these concerns by editing or unlearning undesired concepts from pre-trained models. However, these methods often involve data-intensive and inefficient fine-tuning or utilize various forms of token remapping, rendering them susceptible to adversarial jailbreaks. In this paper, we present a simple and effective training-free approach, ConceptPrune, wherein we first identify critical regions within pre-trained models responsible for generating undesirable concepts, thereby facilitating straightforward concept unlearning via weight pruning. Experiments across a range of concepts including artistic styles, nudity, object erasure, and gender debiasing demonstrate that target concepts can be efficiently erased by pruning a tiny fraction, approximately 0.12% of total weights, enabling multi-concept erasure and robustness against various white-box and black-box adversarial attacks.
翻译:尽管大规模文本到图像扩散模型已展现出令人印象深刻的图像生成能力,但其潜在滥用风险——如生成不安全内容、侵犯版权及固化社会偏见——引发了广泛担忧。近期,文本到图像生成领域开始通过编辑或从预训练模型中消除不良概念来应对这些问题。然而,现有方法通常依赖数据密集且低效的微调,或采用各种形式的令牌重映射技术,使其易受对抗性越狱攻击的影响。本文提出一种简单有效的免训练方法ConceptPrune:我们首先定位预训练模型中负责生成不良概念的关键区域,进而通过权重剪枝实现直接的概念消除。在艺术风格、裸露内容、物体擦除和性别去偏见等一系列概念上的实验表明,仅需剪除约0.12%的总权重即可高效消除目标概念,该方法支持多概念消除,并能有效抵御各类白盒与黑盒对抗攻击。