Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation

Jessica Quaye,Alicia Parrish,Oana Inel,Charvi Rastogi,Hannah Rose Kirk,Minsuk Kahng,Erin van Liemt,Max Bartolo,Jess Tsang,Justin White,Nathan Clement,Rafael Mosquera,Juan Ciro,Vijay Janapa Reddi,Lora Aroyo

from arxiv, 15 pages, 6 figures

With the rise of text-to-image (T2I) generative AI models reaching wide audiences, it is critical to evaluate model robustness against non-obvious attacks to mitigate the generation of offensive images. By focusing on ``implicitly adversarial'' prompts (those that trigger T2I models to generate unsafe images for non-obvious reasons), we isolate a set of difficult safety issues that human creativity is well-suited to uncover. To this end, we built the Adversarial Nibbler Challenge, a red-teaming methodology for crowdsourcing a diverse set of implicitly adversarial prompts. We have assembled a suite of state-of-the-art T2I models, employed a simple user interface to identify and annotate harms, and engaged diverse populations to capture long-tail safety issues that may be overlooked in standard testing. The challenge is run in consecutive rounds to enable a sustained discovery and analysis of safety pitfalls in T2I models. In this paper, we present an in-depth account of our methodology, a systematic study of novel attack strategies and discussion of safety failures revealed by challenge participants. We also release a companion visualization tool for easy exploration and derivation of insights from the dataset. The first challenge round resulted in over 10k prompt-image pairs with machine annotations for safety. A subset of 1.5k samples contains rich human annotations of harm types and attack styles. We find that 14% of images that humans consider harmful are mislabeled as ``safe'' by machines. We have identified new attack strategies that highlight the complexity of ensuring T2I model robustness. Our findings emphasize the necessity of continual auditing and adaptation as new vulnerabilities emerge. We are confident that this work will enable proactive, iterative safety assessments and promote responsible development of T2I models.

翻译：随着面向广泛受众的文本到图像（T2I）生成式AI模型的兴起，评估模型对非明显攻击的鲁棒性以降低生成冒犯性图像的风险变得至关重要。通过聚焦"隐含对抗性"提示（即那些因非明显原因触发T2I模型生成不安全图像的提示），我们分离出一组人类创造力特别适合发现的棘手安全问题。为此，我们构建了对抗性诱饵挑战，这是一种用于众包多样化隐含对抗性提示的红队方法。我们组装了一套最先进T2I模型套件，采用简单用户界面识别和标注危害，并吸引多样化人群捕捉标准测试中可能被忽略的长尾安全问题。该挑战以连续轮次进行，以实现对T2I模型安全隐患的持续发现与分析。本文详细阐述了我们的方法论，系统研究了挑战参与者揭示的新型攻击策略和安全故障。我们还发布了一个配套可视化工具，便于用户轻松探索数据集并提取洞见。首轮挑战产生了超过一万个带有机器安全标注的提示-图像对。其中1500个样本包含丰富的人类标注，涵盖危害类型和攻击风格。我们发现，人类认为有害的图像中有14%被机器误标为"安全"。我们识别出的新型攻击策略凸显了确保T2I模型鲁棒性的复杂性。研究结果强调随着新漏洞的出现，持续审计与适应的必要性。我们相信，这项工作将促进主动、迭代的安全评估，并推动T2I模型的负责任开发。