Warning: this paper contains content that may be inappropriate or offensive. As generative models become available for public use in various applications, testing and analyzing vulnerabilities of these models has become a priority. Here we propose an automatic red teaming framework that evaluates a given model and exposes its vulnerabilities against unsafe and inappropriate content generation. Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation. We propose different in-context attack strategies to automatically learn effective and diverse adversarial prompts for text-to-image models. Our experiments demonstrate that compared to baseline approaches, our proposed strategy is significantly more effective in exposing vulnerabilities in Stable Diffusion (SD) model, even when the latter is enhanced with safety features. Furthermore, we demonstrate that the proposed framework is effective for red teaming text-to-text models, resulting in significantly higher toxic response generation rate compared to previously reported numbers.
翻译:警告:本文包含可能不适当或冒犯性的内容。随着生成模型在各类应用中向公众开放,测试和分析这些模型的漏洞已成为优先事项。本文提出一种自动红队测试框架,用于评估给定模型并暴露其在生成不安全和不适当内容方面的漏洞。该框架利用反馈环路中的上下文学习来对模型进行红队测试,并触发其生成不安全内容。我们提出了不同的上下文攻击策略,以自动学习针对文本到图像模型的有效且多样的对抗性提示。实验表明,与基线方法相比,我们提出的策略在暴露稳定扩散模型(Stable Diffusion,SD)的漏洞方面显著更有效,即使该模型已通过安全功能增强。此外,我们证明了所提议的框架对文本到文本模型的红队测试同样有效,与先前报告的数字相比,其能够显著提高有毒响应的生成率。