When Safe Concepts Become Unsafe: Multi-Concept Compositional Vulnerabilities in Text-to-Image Models

Text-to-image (T2I) models are increasingly optimized for following user instructions faithfully. However, we find that this capability introduces a safety vulnerability we call Multi-Concept Compositional Unsafety (MCCU). MCCU occurs when multiple individually safe concepts, if combined in a single generation request, lead to harmful or sensitive visual outputs. Unlike prior jailbreak settings, MCCU does not rely on adversarial prompts, model access, or explicitly disallowed content. Instead, the risk emerges from how the model composes multiple safe visual concepts into a single scene. To systematically measure this threat, we build TwoHamsters, a large-scale evaluation framework consisting of 20k prompts, 51 curated concept pairs, and six risk categories. We evaluate 13 T2I models under a black-box setting. Our results show a clear conflict between instruction-following and safety: models that follow prompts more faithfully tend to produce more MCCU failures. For example, FLUX.1 achieves a 99.35% Unsafe Alignment Rate while only reaching a 1.57% MCCU Defense Rate. We further evaluate three representative defenses, including safety filtering, MCCU-specific detector fine-tuning, and concept erasure, all of which fail against unseen concept combinations. Our findings suggest that compositional reasoning in T2I models creates an attack surface that is not captured by existing safety mechanisms. We anticipate the release of TwoHamsters will catalyze community development of advanced generative defense mechanisms.

翻译：暂无翻译

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

IMAGINE-E：最先进文本到图像模型的图像生成智能评估

专知会员服务

13+阅读 · 2025年2月3日

文本到图像合成：十年回顾

专知会员服务

31+阅读 · 2024年11月26日

【CVPR 2022】基于Transformer的图象风格化，StyTr2: Image Style Transfer with Transformers

专知会员服务

11+阅读 · 2022年3月19日

【CVPR 2022】利用变分图信息瓶颈改进子图识别，Improving Subgraph Recognition with Variational Graph Information Bottleneck

专知会员服务

11+阅读 · 2022年3月12日