Text-to-image models are vulnerable to the stepwise "Divide-and-Conquer Attack" (DACA) that utilize a large language model to obfuscate inappropriate content in prompts by wrapping sensitive text in a benign narrative. To mitigate stepwise DACA attacks, we propose a two-layer method involving text summarization followed by binary classification. We assembled the Adversarial Text-to-Image Prompt (ATTIP) dataset ($N=940$), which contained DACA-obfuscated and non-obfuscated prompts. From the ATTIP dataset, we created two summarized versions: one generated by a small encoder model and the other by a large language model. Then, we used an encoder classifier and a GPT-4o classifier to perform content moderation on the summarized and unsummarized prompts. When compared with a classifier that operated over the unsummarized data, our method improved F1 score performance by 31%. Further, the highest recorded F1 score achieved (98%) was produced by the encoder classifier on a summarized ATTIP variant. This study indicates that pre-classification text summarization can inoculate content detection models against stepwise DACA obfuscations.
翻译:文本到图像模型易受分步式"分而治之攻击"(DACA)的影响,该攻击利用大型语言模型将敏感文本包裹在良性叙述中,从而模糊提示中的不当内容。为缓解分步式DACA攻击,我们提出了一种包含文本摘要与二元分类的双层方法。我们构建了对抗性文本到图像提示(ATTIP)数据集($N=940$),其中包含经DACA模糊处理和未经模糊处理的提示。基于ATTIP数据集,我们创建了两个摘要版本:一个由小型编码器模型生成,另一个由大型语言模型生成。随后,我们使用编码器分类器和GPT-4o分类器对摘要化及未摘要化的提示进行内容审核。与直接在未摘要数据上运行的分类器相比,我们的方法将F1分数性能提升了31%。此外,编码器分类器在摘要化ATTIP变体上实现了最高记录F1分数(98%)。本研究表明,分类前的文本摘要处理能使内容检测模型对分步式DACA模糊攻击产生免疫。