Curved Representation Space of Vision Transformers

Neural networks with self-attention (a.k.a. Transformers) like ViT and Swin have emerged as a better alternative to traditional convolutional neural networks (CNNs). However, our understanding of how the new architecture works is still limited. In this paper, we focus on the phenomenon that Transformers show higher robustness against corruptions than CNNs, while not being overconfident. This is contrary to the intuition that robustness increases with confidence. We resolve this contradiction by empirically investigating how the output of the penultimate layer moves in the representation space as the input data moves linearly within a small area. In particular, we show the following. (1) While CNNs exhibit fairly linear relationship between the input and output movements, Transformers show nonlinear relationship for some data. For those data, the output of Transformers moves in a curved trajectory as the input moves linearly. (2) When a data is located in a curved region, it is hard to move it out of the decision region since the output moves along a curved trajectory instead of a straight line to the decision boundary, resulting in high robustness of Transformers. (3) If a data is slightly modified to jump out of the curved region, the movements afterwards become linear and the output goes to the decision boundary directly. In other words, there does exist a decision boundary near the data, which is hard to find only because of the curved representation space. This explains the underconfident prediction of Transformers. Also, we examine mathematical properties of the attention operation that induce nonlinear response to linear perturbation. Finally, we share our additional findings, regarding what contributes to the curved representation space of Transformers, and how the curvedness evolves during training.

翻译：具有自注意力机制（即Transformer）的神经网络（如ViT和Swin）已成为传统卷积神经网络（CNN）的有效替代方案。然而，我们对该新架构工作机制的理解仍然有限。本文聚焦于Transformer相较于CNN在对抗扰动时表现出的更高鲁棒性，同时不会过度自信这一现象。这与"鲁棒性随置信度增强"的直观认知相悖。通过实证研究输入数据在小范围内线性移动时倒数第二层输出在表示空间中的运动轨迹，我们解决了这一矛盾。具体发现如下：（1）CNN中输入与输出运动呈现近似线性关系，而Transformer对部分数据表现出非线性关系。对于这些数据，当输入线性移动时，Transformer的输出沿弯曲轨迹运动。（2）当数据位于弯曲区域时，由于输出沿曲线运动而非直线运动到决策边界，因此难以将其移出决策区域，从而赋予Transformer高鲁棒性。（3）若对数据稍作修改使其跳出弯曲区域，后续运动将恢复线性特征，输出直接逼近决策边界。换言之，数据附近确实存在决策边界，仅因弯曲表示空间而难以被察觉，这解释了Transformer的欠自信预测。此外，我们分析了注意力操作中诱发线性扰动非线性响应的数学特性。最后，我们分享了关于Transformer弯曲表示空间的成因及其在训练过程中的演化规律等补充发现。