Masked Image Modeling (MIM) achieves outstanding success in self-supervised representation learning. Unfortunately, MIM models typically have huge computational burden and slow learning process, which is an inevitable obstacle for their industrial applications. Although the lower layers play the key role in MIM, existing MIM models conduct reconstruction task only at the top layer of encoder. The lower layers are not explicitly guided and the interaction among their patches is only used for calculating new activations. Considering the reconstruction task requires non-trivial inter-patch interactions to reason target signals, we apply it to multiple local layers including lower and upper layers. Further, since the multiple layers expect to learn the information of different scales, we design local multi-scale reconstruction, where the lower and upper layers reconstruct fine-scale and coarse-scale supervision signals respectively. This design not only accelerates the representation learning process by explicitly guiding multiple layers, but also facilitates multi-scale semantical understanding to the input. Extensive experiments show that with significantly less pre-training burden, our model achieves comparable or better performance on classification, detection and segmentation tasks than existing MIM models.
翻译:掩码图像建模(Masked Image Modeling, MIM)在自监督表示学习中取得了显著成功。然而,MIM模型通常面临巨大的计算负担和缓慢的学习过程,这成为其工业应用中的必然障碍。尽管低层网络在MIM中扮演关键角色,但现有MIM模型仅在编码器顶层执行重建任务,低层网络未得到显式指导,其补丁间的交互仅用于计算新的激活值。考虑到重建任务需要非平凡的补丁间交互来推理目标信号,我们将其应用于包括低层和高层在内的多个局部网络层。此外,由于多层网络期望学习不同尺度的信息,我们设计了局部多尺度重建,其中低层和高层分别重建细粒度和粗粒度的监督信号。这一设计不仅通过显式指导多层网络加速了表示学习过程,还促进了输入的多尺度语义理解。大量实验表明,在预训练负担显著降低的情况下,我们的模型在分类、检测和分割任务上取得了与现有MIM模型相当或更优的性能。