Masked image modeling has been demonstrated as a powerful pretext task for generating robust representations that can be effectively generalized across multiple downstream tasks. Typically, this approach involves randomly masking patches (tokens) in input images, with the masking strategy remaining unchanged during training. In this paper, we propose a curriculum learning approach that updates the masking strategy to continually increase the complexity of the self-supervised reconstruction task. We conjecture that, by gradually increasing the task complexity, the model can learn more sophisticated and transferable representations. To facilitate this, we introduce a novel learnable masking module that possesses the capability to generate masks of different complexities, and integrate the proposed module into masked autoencoders (MAE). Our module is jointly trained with the MAE, while adjusting its behavior during training, transitioning from a partner to the MAE (optimizing the same reconstruction loss) to an adversary (optimizing the opposite loss), while passing through a neutral state. The transition between these behaviors is smooth, being regulated by a factor that is multiplied with the reconstruction loss of the masking module. The resulting training procedure generates an easy-to-hard curriculum. We train our Curriculum-Learned Masked Autoencoder (CL-MAE) on ImageNet and show that it exhibits superior representation learning capabilities compared to MAE. The empirical results on five downstream tasks confirm our conjecture, demonstrating that curriculum learning can be successfully used to self-supervise masked autoencoders. We release our code at https://github.com/ristea/cl-mae.
翻译:掩码图像建模已被证明是一种强大的预文本任务,可生成能够有效泛化至多种下游任务的鲁棒表示。通常,该方法涉及在输入图像中随机掩码补丁(令牌),且掩码策略在训练过程中保持不变。在本文中,我们提出了一种课程学习方法,通过更新掩码策略持续增加自监督重建任务的复杂度。我们推测,通过逐步提升任务复杂度,模型能够学习更复杂且更具迁移性的表示。为实现此目标,我们引入了一种新型可学习掩码模块,该模块能够生成不同复杂度的掩码,并将其集成到掩码自编码器(MAE)中。该模块与MAE联合训练,同时在其训练过程中调整行为:从MAE的合作伙伴(优化相同的重建损失)过渡到对抗者(优化相反的损失),并经过中性状态。这些行为之间的转换是平滑的,由一个与掩码模块重建损失相乘的因子进行调节。由此产生的训练过程构建了一个从易到难的课程。我们在ImageNet上训练了课程学习掩码自编码器(CL-MAE),并展示其相比MAE具有更优的表示学习能力。在五个下游任务上的实证结果证实了我们的猜想,表明课程学习可成功用于掩码自编码器的自监督训练。我们已公开代码:https://github.com/ristea/cl-mae。