Vision Transformers (ViTs) are increasingly utilized in various computer vision tasks due to their powerful representation capabilities. However, it remains understudied how ViTs process information layer by layer. Numerous studies have shown that convolutional neural networks (CNNs) extract features of increasing complexity throughout their layers, which is crucial for tasks like domain adaptation and transfer learning. ViTs, lacking the same inductive biases as CNNs, can potentially learn global dependencies from the first layers due to their attention mechanisms. Given the increasing importance of ViTs in computer vision, there is a need to improve the layer-wise understanding of ViTs. In this work, we present a novel, layer-wise analysis of concepts encoded in state-of-the-art ViTs using neuron labeling. Our findings reveal that ViTs encode concepts with increasing complexity throughout the network. Early layers primarily encode basic features such as colors and textures, while later layers represent more specific classes, including objects and animals. As the complexity of encoded concepts increases, the number of concepts represented in each layer also rises, reflecting a more diverse and specific set of features. Additionally, different pretraining strategies influence the quantity and category of encoded concepts, with finetuning to specific downstream tasks generally reducing the number of encoded concepts and shifting the concepts to more relevant categories.
翻译:视觉Transformer(ViTs)凭借其强大的表征能力,正日益广泛应用于各类计算机视觉任务。然而,关于ViT如何逐层处理信息的研究仍显不足。大量研究表明,卷积神经网络(CNNs)在其各层中提取复杂度递增的特征,这对于领域自适应和迁移学习等任务至关重要。由于缺乏与CNNs相同的归纳偏置,ViT凭借其注意力机制,有可能从初始层就开始学习全局依赖关系。鉴于ViT在计算机视觉领域的重要性日益增长,有必要深化对其逐层工作机制的理解。本研究提出了一种新颖的、基于神经元标记的方法,对先进ViT模型中编码的概念进行逐层分析。我们的发现表明,ViT在整个网络中编码的概念复杂度逐层递增。早期层主要编码颜色和纹理等基础特征,而深层则表征更为具体的类别,包括物体和动物。随着编码概念复杂度的提升,每层所表征的概念数量也随之增加,反映出特征集更具多样性和特异性。此外,不同的预训练策略会影响编码概念的数量与类别:针对特定下游任务进行微调通常会减少编码概念的数量,并将概念转向更相关的类别。