Vision Transformers (ViTs) with self-attention modules have recently achieved great empirical success in many vision tasks. Due to non-convex interactions across layers, however, theoretical learning and generalization analysis is mostly elusive. Based on a data model characterizing both label-relevant and label-irrelevant tokens, this paper provides the first theoretical analysis of training a shallow ViT, i.e., one self-attention layer followed by a two-layer perceptron, for a classification task. We characterize the sample complexity to achieve a zero generalization error. Our sample complexity bound is positively correlated with the inverse of the fraction of label-relevant tokens, the token noise level, and the initial model error. We also prove that a training process using stochastic gradient descent (SGD) leads to a sparse attention map, which is a formal verification of the general intuition about the success of attention. Moreover, this paper indicates that a proper token sparsification can improve the test performance by removing label-irrelevant and/or noisy tokens, including spurious correlations. Empirical experiments on synthetic data and CIFAR-10 dataset justify our theoretical results and generalize to deeper ViTs.
翻译:基于自注意力模块的视觉Transformer(ViTs)近期在许多视觉任务中取得了显著的实证成功。然而,由于层间非凸交互作用的存在,其理论学习与泛化分析仍极具挑战性。本文基于一个刻画标签相关与标签无关令牌的数据模型,首次对用于分类任务的浅层ViT(即一个自注意力层后接一个两层感知机)的训练过程进行了理论分析。我们刻画了实现零泛化误差所需的样本复杂度,该复杂度与标签相关令牌占比的倒数、令牌噪声水平以及初始模型误差呈正相关。同时证明,采用随机梯度下降(SGD)的训练过程会产生稀疏注意力图谱,这为注意力机制成功的一般直觉提供了形式化验证。此外,本文表明,通过去除标签无关和/或噪声令牌(包括虚假相关性),适当的令牌稀疏化能够提升测试性能。基于合成数据与CIFAR-10数据集的实验验证了我们的理论结果,并将其推广至更深层ViT。