The widespread use of Text-to-Image (T2I) models in content generation requires careful examination of their safety, including their robustness to adversarial attacks. Despite extensive research on adversarial attacks, the reasons for their effectiveness remain underexplored. This paper presents an empirical study on adversarial attacks against T2I models, focusing on analyzing factors associated with attack success rates (ASR). We introduce a new attack objective - entity swapping using adversarial suffixes and two gradient-based attack algorithms. Human and automatic evaluations reveal the asymmetric nature of ASRs on entity swap: for example, it is easier to replace "human" with "robot" in the prompt "a human dancing in the rain." with an adversarial suffix, but the reverse replacement is significantly harder. We further propose probing metrics to establish indicative signals from the model's beliefs to the adversarial ASR. We identify conditions that result in a success probability of 60% for adversarial attacks and others where this likelihood drops below 5%.
翻译:文本到图像(T2I)模型在内容生成中的广泛应用要求对其安全性进行仔细审视,包括对对抗攻击的鲁棒性。尽管对抗攻击已有大量研究,但其有效性的原因仍未被充分探索。本文对T2I模型的对抗攻击进行了实证研究,重点分析了与攻击成功率(ASR)相关的因素。我们提出了一种新的攻击目标——利用对抗性后缀进行实体替换,以及两种基于梯度的攻击算法。人工评估与自动评估揭示了实体替换中ASR的不对称性:例如,在提示词“雨中跳舞的人”中添加对抗性后缀,更容易将“人”替换为“机器人”,而反向替换则显著困难。我们进一步提出探测指标,从模型信念中建立对抗性ASR的指示性信号。我们识别出使对抗攻击成功概率达到60%的条件,以及该概率低于5%的其他条件。