Text-to-image diffusion models (DMs) are frequently abused to produce harmful or copyrighted content, violating public interests. Concept erasure (unlearning) is a promising paradigm to alleviate this issue. However, there exists a peculiar forgetting illusion phenomenon with unclear cause. Based on empirical analysis, we formally explain this cause: most unlearning partially disrupt the mapping between linguistic symbols and the underlying internal knowledge, leaving the knowledge intact as dormant memories. We further demonstrate that distributional discrepancy in the denoising process serves as a measurable indicator of how much of the mapping is retained, also reflecting unlearning strength. Inspired by this, we propose IVO (Initial Latent Variable Optimization), a novel attack framework designed to assess the robustness of current unlearning methods. IVO optimizes initial latent variables to realign the noise distribution of unlearned models with that of their vanilla counterparts, which reconstructs the fractured mappings and consequently revives dormant memories. Extensive experiments covering 11 unlearning techniques and 3 concept scenarios show that IVO outperforms state-of-the-art baselines, exposing fundamental flaws in current unlearning mechanisms. Warning: This paper has unsafe images that may offend some readers.
翻译:文本到图像扩散模型常被滥用于生成有害或受版权保护的内容,侵害公共利益。概念擦除(去学习)是缓解该问题的有前景范式。然而,存在一种原因不明的奇特遗忘幻象现象。基于实证分析,我们正式解释了该现象的成因:多数去学习过程仅部分破坏语言符号与底层内部知识之间的映射,使知识以休眠记忆形式完整保留。我们进一步证明,去噪过程中的分布差异可作为映射保留程度的可量化指标,同时反映去学习强度。受此启发,我们提出IVO(初始隐变量优化)——一种评估现有去学习方法鲁棒性的新型攻击框架。IVO通过优化初始隐变量,使去学习模型的噪声分布与其原始模型重新对齐,从而重建被破坏的映射并激活休眠记忆。涵盖11种去学习技术与3种概念场景的大量实验表明,IVO优于现有基线方法,揭示了当前去学习机制的根本缺陷。警告:本文包含可能令部分读者不适的危险图像。