In this paper, we present a comparative analysis of various self-supervised Vision Transformers (ViTs), focusing on their local representative power. Inspired by large language models, we examine the abilities of ViTs to perform various computer vision tasks with little to no fine-tuning. We design evaluation framework to analyze the quality of local, i.e.\ patch-level, representations in the context of few-shot semantic segmentation, instance identification, object retrieval and tracking. We discover that contrastive learning based methods like DINO produce more universal patch representations that can be immediately applied for downstream tasks with no parameter tuning, compared to masked image modeling. The embeddings learned using the latter approach, e.g. in masked autoencoders, have high variance features that harm distance-based algorithms, such as k-NN, and do not contain useful information for most downstream tasks. Furthermore, we demonstrate that removing these high-variance features enhances k-NN for MAE, as well as for its recent extension Scale-MAE. Finally, we find an object instance retrieval setting where DINOv2, a model pretrained on two orders of magnitude more data, falls short of its less compute intensive counterpart DINO.
翻译:本文对多种自监督视觉Transformer(ViT)的局部表示能力进行了比较分析。受大语言模型启发,我们考察了ViT在几乎无需微调的情况下执行各类计算机视觉任务的能力。我们设计了评估框架,分析少样本语义分割、实例识别、目标检索与跟踪等场景中局部(即图块级)表示的质量。研究发现,相较于掩码图像建模,基于对比学习的方法(如DINO)能产生更通用的图块表示,可直接应用于下游任务而无需参数调整。采用后一种方法(例如掩码自编码器)学习的嵌入具有高方差特征,这会损害基于距离的算法(如k-NN),且不包含对多数下游任务有用的信息。此外,我们证明移除这些高方差特征可增强MAE及其近期扩展模型Scale-MAE的k-NN性能。最后,我们发现在目标实例检索场景中,基于多两个数量级数据预训练的DINOv2模型,其表现反而不及计算量更小的对照模型DINO。