Recent advances in text-to-image diffusion models have demonstrated remarkable generation capabilities, yet they raise significant concerns regarding safety, copyright, and ethical implications. Existing concept erasure methods address these risks by removing sensitive concepts from pre-trained models, but most of them rely on data-intensive and computationally expensive fine-tuning, which poses a critical limitation. To overcome these challenges, inspired by the observation that the model's activations are predominantly composed of generic concepts, with only a minimal component can represent the target concept, we propose a novel training-free method (ActErase) for efficient concept erasure. Specifically, the proposed method operates by identifying activation difference regions via prompt-pair analysis, extracting target activations and dynamically replacing input activations during forward passes. Comprehensive evaluations across three critical erasure tasks (nudity, artistic style, and object removal) demonstrates that our training-free method achieves state-of-the-art (SOTA) erasure performance, while effectively preserving the model's overall generative capability. Our approach also exhibits strong robustness against adversarial attacks, establishing a new plug-and-play paradigm for lightweight yet effective concept manipulation in diffusion models.
翻译:近年来,文本到图像扩散模型在生成能力方面取得了显著进展,但也引发了关于安全性、版权和伦理影响的重要关切。现有的概念擦除方法通过从预训练模型中移除敏感概念来应对这些风险,但大多数方法依赖于数据密集且计算成本高昂的微调,这构成了一个关键限制。为克服这些挑战,受模型激活主要由通用概念组成、仅极小部分可表示目标概念这一观察的启发,我们提出了一种新颖的免训练方法(ActErase)以实现高效概念擦除。具体而言,该方法通过提示对分析识别激活差异区域,提取目标激活并在前向传播过程中动态替换输入激活。在三个关键擦除任务(裸露内容、艺术风格和对象移除)上的综合评估表明,我们的免训练方法实现了最先进的擦除性能,同时有效保持了模型的整体生成能力。我们的方法还展现出对抗对抗攻击的强大鲁棒性,为扩散模型中的轻量级且高效的概念操控建立了一种新的即插即用范式。