Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks for very large fully-annotated datasets. Different classes of self-supervised learning offer representations with either good contextual reasoning properties, e.g., using masked image modeling strategies, or invariance to image perturbations, e.g., with contrastive methods. In this work, we propose a single-stage and standalone method, MOCA, which unifies both desired properties using novel mask-and-predict objectives defined with high-level features (instead of pixel-level details). Moreover, we show how to effectively employ both learning paradigms in a synergistic and computation-efficient way. Doing so, we achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols with a training that is at least 3 times faster than prior methods.
翻译:自监督学习可用于缓解视觉Transformer网络对大规模全标注数据集的贪婪需求。不同类别的自监督学习能提供具有良好上下文推理能力(例如使用掩码图像建模策略)或图像扰动不变性(例如通过对比方法)的表示。本研究提出单阶段独立方法MOCA,该模型利用高阶特征(而非像素级细节)定义的新型掩码预测目标,统一了上述两种理想特性。此外,我们证明了如何以协同且计算高效的方式有效运用这两种学习范式。由此,我们在低样本场景下取得了新的最佳结果,并在多种评估协议中展现出强劲的实验性能,训练速度较先前方法至少提升3倍。