We introduce MIM (Masked Image Modeling)-Refiner, a contrastive learning boost for pre-trained MIM models. The motivation behind MIM-Refiner is rooted in the insight that optimal representations within MIM models generally reside in intermediate layers. Accordingly, MIM-Refiner leverages multiple contrastive heads that are connected to diverse intermediate layers. In each head, a modified nearest neighbor objective helps to construct respective semantic clusters. The refinement process is short but effective. Within a few epochs, we refine the features of MIM models from subpar to state-of-the-art, off-the-shelf features. Refining a ViT-H, pre-trained with data2vec 2.0 on ImageNet-1K, achieves new state-of-the-art results in linear probing (84.7%) and low-shot classification among models that are pre-trained on ImageNet-1K. In ImageNet-1K 1-shot classification, MIM-Refiner sets a new state-of-the-art of 64.2%, outperforming larger models that were trained on up to 2000x more data such as DINOv2-g, OpenCLIP-G and MAWS-6.5B. Project page: https://ml-jku.github.io/MIM-Refiner
翻译:我们提出了MIM(掩码图像建模)-Refiner,一种针对预训练MIM模型的对比学习增强方法。MIM-Refiner的动机源于一个关键发现:MIM模型中的最优表示通常位于中间层。基于此,MIM-Refiner利用多个连接至不同中间层的对比头。在每个对比头中,一种改进的最近邻目标函数有助于构建相应的语义聚类。该优化过程虽然短暂但十分有效。在短短数个训练周期内,我们即可将MIM模型的特征从次优水平提升至当前最佳的直接可用特征。对使用data2vec 2.0在ImageNet-1K上预训练的ViT-H模型进行优化后,其在线性探测(84.7%)和低样本分类任务上均取得了ImageNet-1K预训练模型中的最新最佳结果。在ImageNet-1K的1样本分类中,MIM-Refiner以64.2%的准确率刷新了纪录,超越了在多达2000倍数据上训练的更大模型,如DINOv2-g、OpenCLIP-G和MAWS-6.5B。项目页面:https://ml-jku.github.io/MIM-Refiner