We introduce MIM (Masked Image Modeling)-Refiner, a contrastive learning boost for pre-trained MIM models. MIM-Refiner is motivated by the insight that strong representations within MIM models generally reside in intermediate layers. Accordingly, MIM-Refiner leverages multiple contrastive heads that are connected to different intermediate layers. In each head, a modified nearest neighbor objective constructs semantic clusters that capture semantic information which improves performance on downstream tasks, including off-the-shelf and fine-tuning settings. The refinement process is short and simple - yet highly effective. Within a few epochs, we refine the features of MIM models from subpar to state-of-the-art, off-the-shelf features. Refining a ViT-H, pre-trained with data2vec 2.0 on ImageNet-1K, sets a new state-of-the-art in linear probing (84.7%) and low-shot classification among models that are pre-trained on ImageNet-1K. At ImageNet-1K 1-shot classification, MIM-Refiner advances the state-of-the-art to 64.2%, outperforming larger models that were trained on up to 2000 times more data such as DINOv2-g, OpenCLIP-G and MAWS-6.5B.
翻译:我们提出了MIM(掩码图像建模)-Refiner,一种针对预训练MIM模型的对比学习增强方法。MIM-Refiner的动机源于以下洞见:MIM模型中的强表征通常驻留在中间层。因此,MIM-Refiner利用多个连接到不同中间层的对比头。在每个头中,改进的最近邻目标构建了语义簇,这些语义簇捕获的语义信息提升了在下游任务(包括即插即用和微调设置)上的性能。该精炼过程简短而简单——却极为有效。在少数几个训练周期内,我们将MIM模型的特征从次优水平精炼至最先进的即插即用特征。对使用data2vec 2.0在ImageNet-1K上预训练的ViT-H模型进行精炼,在基于ImageNet-1K预训练的模型中,为线性探测(84.7%)和少样本分类设立了新的最高水平。在ImageNet-1K单样本分类任务中,MIM-Refiner将最高水平提升至64.2%,超越了使用多达2000倍数据训练的大型模型,如DINOv2-g、OpenCLIP-G和MAWS-6.5B。