Concept erasure helps stop diffusion models (DMs) from generating harmful content; but current methods face robustness retention trade off. Robustness means the model fine-tuned by concept erasure methods resists reactivation of erased concepts, even under semantically related prompts. Retention means unrelated concepts are preserved so the model's overall utility stays intact. Both are critical for concept erasure in practice, yet addressing them simultaneously is challenging, as existing works typically improve one factor while sacrificing the other. Prior work typically strengthens one while degrading the other, e.g., mapping a single erased prompt to a fixed safe target leaves class level remnants exploitable by prompt attacks, whereas retention-oriented schemes underperform against adaptive adversaries. This paper introduces Adversarial Erasure with Gradient Informed Synergy (AEGIS), a retention-data-free framework that advances both robustness and retention.
翻译:概念擦除有助于阻止扩散模型生成有害内容;然而,现有方法面临鲁棒性与保留性的权衡。鲁棒性指经概念擦除方法微调后的模型能够抵抗被擦除概念的重新激活,即使在语义相关提示下亦然。保留性指无关概念得以保持,从而维持模型的整体实用性。两者对于实际应用中的概念擦除均至关重要,但现有研究通常难以兼顾,往往在提升某一指标时牺牲另一指标。例如,将单一擦除提示映射至固定安全目标会遗留类别级残余,易受提示攻击利用;而以保留为导向的方案在面对自适应对抗攻击时表现欠佳。本文提出基于梯度信息协同的对抗性擦除框架AEGIS,该无需保留数据的框架可同步提升鲁棒性与保留性。