Vision Transformers have been incredibly effective when tackling computer vision tasks due to their ability to model long feature dependencies. By using large-scale training data and various self-supervised signals (e.g., masked random patches), vision transformers provide state-of-the-art performance on several benchmarking datasets, such as ImageNet-1k and CIFAR-10. However, these vision transformers pretrained over general large-scale image corpora could only produce an anisotropic representation space, limiting their generalizability and transferability to the target downstream tasks. In this paper, we propose a simple and effective Label-aware Contrastive Training framework LaCViT, which improves the isotropy of the pretrained representation space for vision transformers, thereby enabling more effective transfer learning amongst a wide range of image classification tasks. Through experimentation over five standard image classification datasets, we demonstrate that LaCViT-trained models outperform the original pretrained baselines by around 9% absolute Accuracy@1, and consistent improvements can be observed when applying LaCViT to our three evaluated vision transformers.
翻译:视觉Transformer因其建模长程特征依赖的能力,在解决计算机视觉任务中展现出卓越效果。通过利用大规模训练数据和多种自监督信号(例如掩码随机补丁),视觉Transformer在多项基准数据集(如ImageNet-1k和CIFAR-10)上实现了最先进的性能。然而,这些在通用大规模图像语料库上预训练的视觉Transformer仅能生成各向异性的表征空间,限制了其在目标下游任务中的泛化能力和迁移性。本文提出了一种简单有效的标签感知对比训练框架LaCViT,该框架改善了视觉Transformer预训练表征空间的各向同性,从而在广泛的图像分类任务中实现更有效的迁移学习。通过在五个标准图像分类数据集上的实验,我们证明LaCViT训练的模型在Accuracy@1指标上相比原始预训练基准模型提升约9%,且将LaCViT应用于我们评估的三个视觉Transformer时均可观察到一致的性能改进。