We investigate the robustness of vision transformers (ViTs) through the lens of their special patch-based architectural structure, i.e., they process an image as a sequence of image patches. We find that ViTs are surprisingly insensitive to patch-based transformations, even when the transformation largely destroys the original semantics and makes the image unrecognizable by humans. This indicates that ViTs heavily use features that survived such transformations but are generally not indicative of the semantic class to humans. Further investigations show that these features are useful but non-robust, as ViTs trained on them can achieve high in-distribution accuracy, but break down under distribution shifts. From this understanding, we ask: can training the model to rely less on these features improve ViT robustness and out-of-distribution performance? We use the images transformed with our patch-based operations as negatively augmented views and offer losses to regularize the training away from using non-robust features. This is a complementary view to existing research that mostly focuses on augmenting inputs with semantic-preserving transformations to enforce models' invariance. We show that patch-based negative augmentation consistently improves robustness of ViTs across a wide set of ImageNet based robustness benchmarks. Furthermore, we find our patch-based negative augmentation are complementary to traditional (positive) data augmentation, and together boost the performance further.
翻译:我们通过视觉Transformer(ViTs)特有的基于补丁的架构结构(即将图像处理为一系列图像补丁)来研究其鲁棒性。我们发现ViTs对基于补丁的变换出奇地不敏感,即使这种变换在很大程度上破坏了原始语义,使图像对人类无法识别。这表明ViTs大量使用了在这种变换下幸存但通常不指示语义类别的特征。进一步的研究表明,这些特征虽然有用但不鲁棒,因为基于这些特征训练的ViT在分布内数据上能实现高精度,但在分布偏移下性能会崩溃。基于这一理解,我们提出:训练模型减少对这些特征的依赖是否能提高ViT的鲁棒性和分布外性能?我们将通过基于补丁的操作变换后的图像作为负增强视图,并引入损失函数来规范训练过程,使其避免使用非鲁棒特征。这与现有研究主要采用语义保持变换进行输入增强以强化模型不变性的视角形成互补。我们证明,基于补丁的负增强能在广泛的基于ImageNet的鲁棒性基准测试中持续提升ViT的鲁棒性。此外,我们发现基于补丁的负增强与传统的(正)数据增强具有互补性,二者结合能进一步提升性能。