Improving Robustness for Vision Transformer with a Simple Dynamic Scanning Augmentation

Vision Transformer (ViT) has demonstrated promising performance in computer vision tasks, comparable to state-of-the-art neural networks. Yet, this new type of deep neural network architecture is vulnerable to adversarial attacks limiting its capabilities in terms of robustness. This article presents a novel contribution aimed at further improving the accuracy and robustness of ViT, particularly in the face of adversarial attacks. We propose an augmentation technique called `Dynamic Scanning Augmentation' that leverages dynamic input sequences to adaptively focus on different patches, thereby maintaining performance and robustness. Our detailed investigations reveal that this adaptability to the input sequence induces significant changes in the attention mechanism of ViT, even for the same image. We introduce four variations of Dynamic Scanning Augmentation, outperforming ViT in terms of both robustness to adversarial attacks and accuracy against natural images, with one variant showing comparable results. By integrating our augmentation technique, we observe a substantial increase in ViT's robustness, improving it from $17\%$ to $92\%$ measured across different types of adversarial attacks. These findings, together with other comprehensive tests, indicate that Dynamic Scanning Augmentation enhances accuracy and robustness by promoting a more adaptive type of attention. In conclusion, this work contributes to the ongoing research on Vision Transformers by introducing Dynamic Scanning Augmentation as a technique for improving the accuracy and robustness of ViT. The observed results highlight the potential of this approach in advancing computer vision tasks and merit further exploration in future studies.

翻译：视觉Transformer（ViT）在计算机视觉任务中展现出与最先进神经网络相媲美的优异性能。然而，这类新型深度神经网络架构容易受到对抗攻击的影响，限制了其鲁棒性能力。本文提出了一项创新性贡献，旨在进一步提升ViT的准确性和鲁棒性，尤其是在面对对抗攻击时。我们提出一种名为"动态扫描增强"的增强技术，该技术利用动态输入序列自适应地关注不同图像块，从而保持性能与鲁棒性。详细研究表明：这种对输入序列的适应性会显著改变ViT的注意力机制（即使对同一图像也如此）。我们引入了四种动态扫描增强变体，在对抗攻击鲁棒性和自然图像准确性两方面均优于ViT（其中一种变体呈现可比结果）。通过集成所提增强技术，我们观察到ViT的鲁棒性得到显著提升——在不同类型对抗攻击的测试中，该指标从17%提升至92%。这些发现连同其他综合性测试表明：动态扫描增强通过促进更自适应的注意力类型来提升准确性与鲁棒性。总之，本文通过引入动态扫描增强技术作为改善ViT准确性与鲁棒性的方法，为视觉Transformer的持续研究做出贡献。观测结果凸显了该方法在推进计算机视觉任务中的潜力，值得在未来的研究中进一步探索。