Text-to-image generative models can produce photo-realistic images for an extremely broad range of concepts, and their usage has proliferated widely among the general public. On the flip side, these models have numerous drawbacks, including their potential to generate images featuring sexually explicit content, mirror artistic styles without permission, or even hallucinate (or deepfake) the likenesses of celebrities. Consequently, various methods have been proposed in order to "erase" sensitive concepts from text-to-image models. In this work, we examine five recently proposed concept erasure methods, and show that targeted concepts are not fully excised from any of these methods. Specifically, we leverage the existence of special learned word embeddings that can retrieve "erased" concepts from the sanitized models with no alterations to their weights. Our results highlight the brittleness of post hoc concept erasure methods, and call into question their use in the algorithmic toolkit for AI safety.
翻译:文本到图像生成模型能够为极其广泛的概念生成逼真图像,并已在公众中广泛普及。然而,这些模型存在诸多缺陷,包括可能生成包含露骨色情内容的图像、未经许可模仿艺术风格、甚至制造名人肖像的幻觉(或深度伪造)。因此,研究者提出了多种方法以“擦除”文本到图像模型中的敏感概念。本研究考察了近期提出的五种概念擦除方法,并表明目标概念并未被任何方法完全清除。具体而言,我们利用特殊学习词嵌入的存在,这些嵌入无需修改权重即可从净化后的模型中检索出“被擦除”的概念。我们的结果突显了事后概念擦除方法的脆弱性,并对其在人工智能安全算法工具箱中的应用提出质疑。