Although unlearning-based defenses claim to purge Not-Safe-For-Work (NSFW) concepts from diffusion models (DMs), we reveals that this "forgetting" is largely an illusion. Unlearning partially disrupts the mapping between linguistic symbols and the underlying knowledge, which remains intact as dormant memories. We find that the distributional discrepancy in the denoising process serves as a measurable indicator of how much of the mapping is retained, also reflecting the strength of unlearning. Inspired by this, we propose IVO (Initial Latent Variable Optimization), a concise and powerful attack framework that reactivates these dormant memories by reconstructing the broken mappings. Through Image Inversion}, Adversarial Optimization and Reused Attack, IVO optimizes initial latent variables to realign the noise distribution of unlearned models with their original unsafe states. Extensive experiments across 8 widely used unlearning techniques demonstrate that IVO achieves superior attack success rates and strong semantic consistency, exposing fundamental flaws in current defenses. The code is available at anonymous.4open.science/r/IVO/. Warning: This paper has unsafe images that may offend some readers.
翻译:尽管基于未学习的防御声称能从扩散模型中清除不适宜工作场所(NSFW)概念,但我们揭示这种"遗忘"在很大程度上是一种幻象。未学习过程部分破坏了语言符号与底层知识之间的映射关系,而这些知识仍作为休眠记忆完整保留。我们发现去噪过程中的分布差异可作为衡量映射保留程度的可量化指标,同时反映了未学习的强度。受此启发,我们提出IVO(初始潜变量优化)——一种简洁而强大的攻击框架,通过重建被破坏的映射来重新激活这些休眠记忆。通过图像反演、对抗性优化和复用攻击,IVO优化初始潜变量以使未学习模型的噪声分布与其原始不安全状态重新对齐。在8种广泛使用的未学习技术上进行的大量实验表明,IVO实现了卓越的攻击成功率与强语义一致性,揭示了当前防御机制的根本缺陷。代码发布于anonymous.4open.science/r/IVO/。警告:本文包含可能冒犯部分读者的不安全图像。