Recent advances in text-to-image diffusion models have demonstrated remarkable generation capabilities, yet they raise significant concerns regarding safety, copyright, and ethical implications. Existing concept erasure methods address these risks by removing sensitive concepts from pre-trained models, but most of them rely on data-intensive and computationally expensive fine-tuning, which poses a critical limitation. To overcome these challenges, inspired by the observation that the model's activations are predominantly composed of generic concepts, with only a minimal component can represent the target concept, we propose a novel training-free method (ActErase) for efficient concept erasure. Specifically, the proposed method operates by identifying activation difference regions via prompt-pair analysis, extracting target activations and dynamically replacing input activations during forward passes. Comprehensive evaluations across three critical erasure tasks (nudity, artistic style, and object removal) demonstrates that our training-free method achieves state-of-the-art (SOTA) erasure performance, while effectively preserving the model's overall generative capability. Our approach also exhibits strong robustness against adversarial attacks, establishing a new plug-and-play paradigm for lightweight yet effective concept manipulation in diffusion models.
翻译:近期文本到图像扩散模型的发展展现了卓越的生成能力,但也引发了在安全性、版权及伦理方面的重大关切。现有的概念擦除方法通过从预训练模型中移除敏感概念来应对这些风险,但大多数方法依赖数据密集且计算成本高昂的微调过程,这构成了关键局限性。受模型激活主要由通用概念构成、仅极小部分能表征目标概念这一观察启发,我们提出了一种新颖的无训练方法(ActErase)用于高效概念擦除。具体而言,所提方法通过提示对分析识别激活差异区域,提取目标激活并在前向传播过程中动态替换输入激活。在三个关键擦除任务(裸体内容、艺术风格及物体移除)上的全面评估表明,我们的无训练方法实现了最先进的(SOTA)擦除性能,同时有效保留了模型的整体生成能力。该方法对对抗性攻击也展现出强鲁棒性,为扩散模型中轻量级且有效的概念操控建立了即插即用的新范式。