Robustifying Token Attention for Vision Transformers

Despite the success of vision transformers (ViTs), they still suffer from significant drops in accuracy in the presence of common corruptions, such as noise or blur. Interestingly, we observe that the attention mechanism of ViTs tends to rely on few important tokens, a phenomenon we call token overfocusing. More critically, these tokens are not robust to corruptions, often leading to highly diverging attention patterns. In this paper, we intend to alleviate this overfocusing issue and make attention more stable through two general techniques: First, our Token-aware Average Pooling (TAP) module encourages the local neighborhood of each token to take part in the attention mechanism. Specifically, TAP learns average pooling schemes for each token such that the information of potentially important tokens in the neighborhood can adaptively be taken into account. Second, we force the output tokens to aggregate information from a diverse set of input tokens rather than focusing on just a few by using our Attention Diversification Loss (ADL). We achieve this by penalizing high cosine similarity between the attention vectors of different tokens. In experiments, we apply our methods to a wide range of transformer architectures and improve robustness significantly. For example, we improve corruption robustness on ImageNet-C by 2.4% while improving accuracy by 0.4% based on state-of-the-art robust architecture FAN. Also, when fine-tuning on semantic segmentation tasks, we improve robustness on CityScapes-C by 2.4% and ACDC by 3.0%. Our code is available at https://github.com/guoyongcs/TAPADL.

翻译：尽管视觉Transformer（ViTs）取得了成功，但在面对噪声或模糊等常见干扰时，其准确率仍会显著下降。有趣的是，我们观察到ViTs的注意力机制倾向于依赖少数重要标记，我们将这一现象称为"标记过度聚焦"。更关键的是，这些标记对干扰不具备鲁棒性，往往会引发高度发散的注意力模式。本文旨在通过两种通用技术缓解这种过度聚焦问题，并使注意力更加稳定：首先，我们的标记感知平均池化（TAP）模块鼓励每个标记的局部邻域参与注意力机制。具体而言，TAP为每个标记学习平均池化方案，使得邻域中潜在重要标记的信息能够自适应地被纳入考虑。其次，我们通过注意力多样化损失（ADL）强制输出标记从多样化的输入标记集合中聚合信息，而非仅聚焦于少数标记。这通过惩罚不同标记注意力向量之间的高余弦相似度来实现。实验中，我们将方法应用于多种Transformer架构，显著提升了鲁棒性。例如，基于最先进的鲁棒架构FAN，我们在ImageNet-C上的干扰鲁棒性提升2.4%，同时准确率提高0.4%。此外，在语义分割任务微调中，我们在CityScapes-C上的鲁棒性提升了2.4%，在ACDC上提升了3.0%。代码已开源至https://github.com/guoyongcs/TAPADL。