The widespread use of Text-to-Image (T2I) models in content generation requires careful examination of their safety, including their robustness to adversarial attacks. Despite extensive research into this, the reasons for their effectiveness are underexplored. This paper presents an empirical study on adversarial attacks against T2I models, focusing on analyzing factors associated with attack success rates (ASRs). We introduce a new attack objective - entity swapping using adversarial suffixes and two gradient-based attack algorithms. Human and automatic evaluations reveal the asymmetric nature of ASRs on entity swap: for example, it is easier to replace "human" with "robot" in the prompt "a human dancing in the rain." with an adversarial suffix but is significantly harder in reverse. We further propose probing metrics to establish indicative signals from the model's beliefs to the adversarial ASR. We identify conditions resulting in a 60% success probability for adversarial attacks and others where this likelihood drops below 5%.
翻译:文本到图像(T2I)模型在内容生成中的广泛应用要求对其安全性进行仔细检查,包括其对对抗攻击的鲁棒性。尽管已有大量相关研究,但对抗攻击有效性的原因尚待深入探索。本文针对T2I模型的对抗攻击开展实证研究,重点分析影响攻击成功率(ASR)的因素。我们提出一种新的攻击目标——利用对抗后缀实现实体交换,并设计了两种基于梯度的攻击算法。人工评估与自动评估揭示了实体交换中ASR的非对称性:例如,在提示词"a human dancing in the rain."中,通过添加对抗后缀将"人类"替换为"机器人"较为容易,但反向替换则显著困难。我们进一步提出探测指标,从模型置信度中建立指向对抗攻击ASR的指示性信号。研究识别出使对抗攻击成功率可达60%的条件,以及另一些条件下该概率低于5%的情况。