In recent years, Text-to-Image (T2I) models have seen remarkable advancements, gaining widespread adoption. However, this progress has inadvertently opened avenues for potential misuse, particularly in generating inappropriate or Not-Safe-For-Work (NSFW) content. Our work introduces MMA-Diffusion, a framework that presents a significant and realistic threat to the security of T2I models by effectively circumventing current defensive measures in both open-source models and commercial online services. Unlike previous approaches, MMA-Diffusion leverages both textual and visual modalities to bypass safeguards like prompt filters and post-hoc safety checkers, thus exposing and highlighting the vulnerabilities in existing defense mechanisms.
翻译:近年来,文本到图像(Text-to-Image, T2I)模型取得了显著进展,并得到广泛应用。然而,这一进步也无意中为潜在滥用行为敞开了大门,尤其是在生成不适当或不宜工作场合(Not-Safe-For-Work, NSFW)的内容方面。我们的工作提出了MMA-Diffusion框架,该框架通过有效规避开源模型及商业在线服务中现有的防御措施,对T2I模型的安全性构成了重大且现实的威胁。与以往方法不同,MMA-Diffusion同时利用文本和视觉模态来绕过提示过滤器(prompt filters)及事后安全检查器(post-hoc safety checkers)等安全机制,从而揭示并凸显了现有防御体系中的薄弱环节。