Vision Transformers (ViTs) have been widely applied in various computer vision and vision-language tasks. To gain insights into their robustness in practical scenarios, transferable adversarial examples on ViTs have been extensively studied. A typical approach to improving adversarial transferability is by refining the surrogate model. However, existing work on ViTs has restricted their surrogate refinement to backward propagation. In this work, we instead focus on Forward Propagation Refinement (FPR) and specifically refine two key modules of ViTs: attention maps and token embeddings. For attention maps, we propose Attention Map Diversification (AMD), which diversifies certain attention maps and also implicitly imposes beneficial gradient vanishing during backward propagation. For token embeddings, we propose Momentum Token Embedding (MTE), which accumulates historical token embeddings to stabilize the forward updates in both the Attention and MLP blocks. We conduct extensive experiments with adversarial examples transferred from ViTs to various CNNs and ViTs, demonstrating that our FPR outperforms the current best (backward) surrogate refinement by up to 7.0\% on average. We also validate its superiority against popular defenses and its compatibility with other transfer methods. Codes and appendix are available at https://github.com/RYC-98/FPR.
翻译:Vision Transformers (ViTs) 已广泛应用于各类计算机视觉与视觉-语言任务。为探究其在真实场景中的鲁棒性,针对ViTs的可迁移对抗样本已得到广泛研究。提升对抗迁移性的典型方法是对代理模型进行优化。然而,现有针对ViTs的研究仅将代理模型优化局限于反向传播过程。本工作中,我们转而聚焦于前向传播优化,并专门针对ViTs的两个关键模块进行优化:注意力图与令牌嵌入。针对注意力图,我们提出注意力图多样化方法,该方法在多样化特定注意力图的同时,亦在反向传播过程中隐式地引入有益的梯度消失效应。针对令牌嵌入,我们提出动量令牌嵌入方法,该方法通过累积历史令牌嵌入以稳定注意力模块与多层感知机模块中的前向更新过程。我们通过将ViTs生成的对抗样本迁移至多种CNN与ViT架构的广泛实验表明,我们的前向传播优化方法平均优于当前最佳的反向传播代理优化方法达7.0%。我们亦验证了该方法在应对主流防御策略时的优越性,及其与其他迁移方法的兼容性。代码与附录详见 https://github.com/RYC-98/FPR。