Masked image modeling has been demonstrated as a powerful pretext task for generating robust representations that can be effectively generalized across multiple downstream tasks. Typically, this approach involves randomly masking patches (tokens) in input images, with the masking strategy remaining unchanged during training. In this paper, we propose a curriculum learning approach that updates the masking strategy to continually increase the complexity of the self-supervised reconstruction task. We conjecture that, by gradually increasing the task complexity, the model can learn more sophisticated and transferable representations. To facilitate this, we introduce a novel learnable masking module that possesses the capability to generate masks of different complexities, and integrate the proposed module into masked autoencoders (MAE). Our module is jointly trained with the MAE, while adjusting its behavior during training, transitioning from a partner to the MAE (optimizing the same reconstruction loss) to an adversary (optimizing the opposite loss), while passing through a neutral state. The transition between these behaviors is smooth, being regulated by a factor that is multiplied with the reconstruction loss of the masking module. The resulting training procedure generates an easy-to-hard curriculum. We train our Curriculum-Learned Masked Autoencoder (CL-MAE) on ImageNet and show that it exhibits superior representation learning capabilities compared to MAE. The empirical results on five downstream tasks confirm our conjecture, demonstrating that curriculum learning can be successfully used to self-supervise masked autoencoders. We release our code at https://github.com/ristea/cl-mae.
翻译:掩码图像建模已被证明是一种强大的预文本任务,能够生成可有效泛化到多个下游任务的鲁棒表示。传统方法通常涉及随机掩码输入图像中的块(标记),且掩码策略在训练过程中保持不变。在本文中,我们提出了一种课程学习方法,通过动态更新掩码策略持续提升自监督重构任务的复杂度。我们推测,通过逐步增加任务复杂度,模型能够学习到更复杂且更可迁移的表示。为此,我们引入了一个新颖的可学习掩码模块,该模块具备生成不同复杂度掩码的能力,并将其集成到掩码自编码器(MAE)中。该模块与MAE联合训练,同时在其训练过程中调整行为:从MAE的伙伴(优化相同的重构损失)过渡到对抗者(优化相反损失),并经过一个中性状态。这些行为之间的过渡通过一个与掩码模块重构损失相乘的因子进行平滑调节。由此产生的训练过程构建了一个从易到难的课程。我们在ImageNet上训练了课程学习掩码自编码器(CL-MAE),并证明其相较于MAE具有更优的表示学习能力。在五个下游任务上的实证结果验证了我们的猜想,表明课程学习可成功用于掩码自编码器的自监督训练。我们的代码已开源在https://github.com/ristea/cl-mae。