Masked image modeling (MIM) has attracted much research attention due to its promising potential for learning scalable visual representations. In typical approaches, models usually focus on predicting specific contents of masked patches, and their performances are highly related to pre-defined mask strategies. Intuitively, this procedure can be considered as training a student (the model) on solving given problems (predict masked patches). However, we argue that the model should not only focus on solving given problems, but also stand in the shoes of a teacher to produce a more challenging problem by itself. To this end, we propose Hard Patches Mining (HPM), a brand-new framework for MIM pre-training. We observe that the reconstruction loss can naturally be the metric of the difficulty of the pre-training task. Therefore, we introduce an auxiliary loss predictor, predicting patch-wise losses first and deciding where to mask next. It adopts a relative relationship learning strategy to prevent overfitting to exact reconstruction loss values. Experiments under various settings demonstrate the effectiveness of HPM in constructing masked images. Furthermore, we empirically find that solely introducing the loss prediction objective leads to powerful representations, verifying the efficacy of the ability to be aware of where is hard to reconstruct.
翻译:掩码图像建模(MIM)因其在学习可扩展视觉表征方面的潜力而备受研究关注。在典型方法中,模型通常侧重于预测掩码补丁的特定内容,其性能高度依赖于预定义的掩码策略。直观上,这一过程可视为训练学生(模型)解决给定问题(预测掩码补丁)。然而,我们认为模型不仅应专注于解决给定问题,还应站在教师的立场上,自行生成更具挑战性的问题。为此,我们提出硬补丁挖掘(Hard Patches Mining, HPM),一种全新的MIM预训练框架。我们观察到重建损失自然可以作为预训练任务难度的度量指标。因此,我们引入一个辅助损失预测器,首先预测逐补丁损失,再决定下一步的掩码位置。该预测器采用相对关系学习策略,以防止对精确重建损失值的过拟合。多种设置下的实验证明了HPM在构建掩码图像方面的有效性。此外,我们经验性地发现,仅引入损失预测目标即可获得强大的表征能力,验证了感知何处难以重建这一能力的有效作用。