Vision transformers have been demonstrated to yield state-of-the-art results on a variety of computer vision tasks using attention-based networks. However, research works in transformers mostly do not investigate robustness/accuracy trade-off, and they still struggle to handle adversarial perturbations. In this paper, we explore the robustness of vision transformers against adversarial perturbations and try to enhance their robustness/accuracy trade-off in white box attack settings. To this end, we propose Locality iN Locality (LNL) transformer model. We prove that the locality introduction to LNL contributes to the robustness performance since it aggregates local information such as lines, edges, shapes, and even objects. In addition, to further improve the robustness performance, we encourage LNL to extract training signal from the moments (a.k.a., mean and standard deviation) and the normalized features. We validate the effectiveness and generality of LNL by achieving state-of-the-art results in terms of accuracy and robustness metrics on German Traffic Sign Recognition Benchmark (GTSRB) and Canadian Institute for Advanced Research (CIFAR-10). More specifically, for traffic sign classification, the proposed LNL yields gains of 1.1% and ~35% in terms of clean and robustness accuracy compared to the state-of-the-art studies.
翻译:视觉Transformer已被证明在使用基于注意力机制的网络时,能在多种计算机视觉任务上取得最先进的结果。然而,关于Transformer的研究大多未探讨鲁棒性与精度之间的权衡,并且仍难以应对对抗性扰动。本文探索了视觉Transformer对对抗性扰动的鲁棒性,并尝试在白盒攻击设置下增强其鲁棒性与精度的权衡。为此,我们提出了LNL(局部中嵌套局部)Transformer模型。我们证明,向LNL引入局部性有助于提升鲁棒性能,因为它能聚合线条、边缘、形状乃至物体等局部信息。此外,为进一步提升鲁棒性能,我们鼓励LNL从矩(即均值和标准差)以及归一化特征中提取训练信号。我们在德国交通标志识别基准(GTSRB)和加拿大高级研究所数据集(CIFAR-10)上,通过取得精度和鲁棒性指标的最先进结果,验证了LNL的有效性和泛化能力。具体而言,在交通标志分类任务中,与现有最先进研究相比,所提出的LNL在干净精度和鲁棒精度上分别提升了1.1%和约35%。