The latest generation of transformer-based vision models have proven to be superior to Convolutional Neural Network (CNN)-based models across several vision tasks, largely attributed to their remarkable prowess in relation modeling. Deformable vision transformers significantly reduce the quadratic complexity of modeling attention by using sparse attention structures, enabling them to be used in larger scale applications such as multi-view vision systems. Recent work demonstrated adversarial attacks against transformers; we show that these attacks do not transfer to deformable transformers due to their sparse attention structure. Specifically, attention in deformable transformers is modeled using pointers to the most relevant other tokens. In this work, we contribute for the first time adversarial attacks that manipulate the attention of deformable transformers, distracting them to focus on irrelevant parts of the image. We also develop new collaborative attacks where a source patch manipulates attention to point to a target patch that adversarially attacks the system. In our experiments, we find that only 1% patched area of the input field can lead to 0% AP. We also show that the attacks provide substantial versatility to support different attacker scenarios because of their ability to redirect attention under the attacker control.
翻译:最新一代基于Transformer的视觉模型已被证明在多项视觉任务上优于基于卷积神经网络(CNN)的模型,这主要归功于其在关系建模方面的卓越能力。可变形视觉Transformer通过使用稀疏注意力结构显著降低了注意力建模的二次复杂度,使其能够应用于多视角视觉系统等更大规模的应用场景。近期研究表明了针对Transformer的对抗攻击,但我们发现这些攻击因可变形Transformer的稀疏注意力结构而无法迁移。具体而言,可变形Transformer中的注意力是通过指向最相关其他token的指针进行建模的。在本工作中,我们首次提出了操纵可变形Transformer注意力的对抗攻击,使其注意力分散到图像不相关的部分。我们还开发了新型协同攻击,其中源补丁操纵注意力指向一个对系统实施对抗攻击的目标补丁。在我们的实验中,仅需输入区域1%的补丁面积即可导致AP降至0%。我们还表明,由于这些攻击能够在攻击者控制下重新定向注意力,因此具有显著的多功能性,可支持不同的攻击者场景。