The expansion of text-to-image diffusion models has raised concerns about harmful outputs, from fabricated depictions of public figures to sexually explicit imagery. To mitigate such risks, prior work has proposed concept erasure methods that aim to sever unwanted concepts from the model via fine-tuning, yet it remains unclear whether these approaches truly remove all links to the harmful concept or merely conceal superficial connections. In this work, we reveal a critical vulnerability, the Erasure Evasion Backdoor (EEB): an adversary binds a backdoor trigger to a concept slated for removal, and this malicious link survives subsequent erasure. We show that both black-box and white-box adversaries can instantiate this threat. Across six state-of-the-art erasure methods, including robust ones that explicitly search for alternative representations of the target concept, EEB consistently exposes harmful content: up to 82% success against celebrity-identity unlearning, up to 94% for object erasure, and up to 16 times amplification of explicit-content exposure. While EEB uncovers a blind spot in current erasure methods, it also provides a diagnostic tool for stress-testing future concept erasure techniques.
翻译:文本到图像扩散模型的扩展引发了对其有害输出的担忧,例如虚构公众人物肖像或露骨色情图像。为缓解此类风险,先前研究提出了概念擦除方法,旨在通过微调切断模型中不想要的概念,然而这些方法是否真正移除了所有与有害概念的关联,还是仅仅掩盖了表面联系,仍不明确。本文揭示了一个关键漏洞——擦除规避后门(EEB):攻击者将一个后门触发器与拟移除的概念绑定,这一恶意链接在后续擦除过程中得以幸存。我们证明,黑盒与白盒攻击者均可实现这一威胁。在六种最先进的概念擦除方法中(包括明确搜索目标概念替代表示的鲁棒方法),EEB持续暴露有害内容:针对名人身份遗忘的成功率高达82%,物体擦除的成功率高达94%,露骨内容暴露程度最高放大16倍。虽然EEB暴露了当前擦除方法的盲区,但它也为对将来概念擦除技术进行压力测试提供了诊断工具。