Vision Transformers have been incredibly effective when tackling computer vision tasks due to their ability to model long feature dependencies. By using large-scale training data and various self-supervised signals (e.g., masked random patches), vision transformers provide state-of-the-art performance on several benchmarking datasets, such as ImageNet-1k and CIFAR-10. However, these vision transformers pretrained over general large-scale image corpora could only produce an anisotropic representation space, limiting their generalizability and transferability to the target downstream tasks. In this paper, we propose a simple and effective Label-aware Contrastive Training framework LaCViT, which improves the isotropy of the pretrained representation space for vision transformers, thereby enabling more effective transfer learning amongst a wide range of image classification tasks. Through experimentation over five standard image classification datasets, we demonstrate that LaCViT-trained models outperform the original pretrained baselines by around 9% absolute Accuracy@1, and consistent improvements can be observed when applying LaCViT to our three evaluated vision transformers.
翻译:视觉Transformer因具备长特征依赖建模能力,在计算机视觉任务中展现出卓越效果。通过利用大规模训练数据及多种自监督信号(例如随机掩码图像块),视觉Transformer在ImageNet-1k和CIFAR-10等多个基准数据集上取得了最先进的性能。然而,这些在通用大规模图像语料库上预训练的视觉Transformer仅能生成各向异性的表征空间,限制了其对下游目标任务的泛化性与迁移能力。本文提出了一种简洁高效的标签感知对比训练框架LaCViT,该框架提升了视觉Transformer预训练表征空间的各向同性,从而在广泛的图像分类任务中实现更有效的迁移学习。通过在五个标准图像分类数据集上的实验验证,采用LaCViT训练的模型在Accuracy@1指标上相较于原始预训练基线模型实现了约9%的绝对提升,且将该框架应用于我们所评估的三种视觉Transformer时均观察到持续的性能改进。