The Vision Transformer has emerged as a powerful tool for image classification tasks, surpassing the performance of convolutional neural networks (CNNs). Recently, many researchers have attempted to understand the robustness of Transformers against adversarial attacks. However, previous researches have focused solely on perturbations in the spatial domain. This paper proposes an additional perspective that explores the adversarial robustness of Transformers against frequency-selective perturbations in the spectral domain. To facilitate comparison between these two domains, an attack framework is formulated as a flexible tool for implementing attacks on images in the spatial and spectral domains. The experiments reveal that Transformers rely more on phase and low frequency information, which can render them more vulnerable to frequency-selective attacks than CNNs. This work offers new insights into the properties and adversarial robustness of Transformers.
翻译:视觉Transformer已成为图像分类任务的强大工具,其性能超越了卷积神经网络(CNN)。近年来,许多研究者尝试理解Transformer在对抗攻击下的鲁棒性。然而,以往的研究仅关注空间域中的扰动。本文提出一个额外视角,从光谱域探索Transformer对频率选择性扰动的对抗鲁棒性。为便于比较这两个域,我们构建了一个攻击框架,作为在空间域和光谱域中对图像实施攻击的灵活工具。实验表明,Transformer更依赖相位和低频信息,这可能使其比CNN更容易受到频率选择性攻击。这项工作为Transformer的特性和对抗鲁棒性提供了新的见解。