Vision transformers (ViT) have been of broad interest in recent theoretical and empirical works. They are state-of-the-art thanks to their attention-based approach, which boosts the identification of key features and patterns within images thanks to the capability of avoiding inductive bias, resulting in highly accurate image analysis. Meanwhile, neoteric studies have reported a ``sparse double descent'' phenomenon that can occur in modern deep-learning models, where extremely over-parametrized models can generalize well. This raises practical questions about the optimal size of the model and the quest over finding the best trade-off between sparsity and performance is launched: are Vision Transformers also prone to sparse double descent? Can we find a way to avoid such a phenomenon? Our work tackles the occurrence of sparse double descent on ViTs. Despite some works that have shown that traditional architectures, like Resnet, are condemned to the sparse double descent phenomenon, for ViTs we observe that an optimally-tuned $\ell_2$ regularization relieves such a phenomenon. However, everything comes at a cost: optimal lambda will sacrifice the potential compression of the ViT.
翻译:视觉Transformer(ViT)近年来在理论和实证研究中受到广泛关注。其基于注意力机制的方法通过避免归纳偏置,能够增强对图像中关键特征和模式的识别能力,从而实现高精度的图像分析,因此达到当前最优性能。与此同时,最新研究报告了现代深度学习模型中可能出现的“稀疏双下降”现象——即极度过参数化的模型仍具有良好的泛化能力。这引发了关于模型最优规模的实践性问题,以及对稀疏性与性能最佳平衡点的探索:视觉Transformer是否也易受稀疏双下降影响?我们能否找到避免该现象的方法?本研究聚焦于ViT中稀疏双下降现象的出现机制。尽管已有研究表明ResNet等传统架构必然存在稀疏双下降现象,但我们在ViT中发现,经过优化调节的$\ell_2$正则化能够缓解该现象。然而任何优化都需付出代价:最优λ值将以牺牲ViT的潜在压缩能力为代价。