Current RGB-D scene recognition approaches often train two standalone backbones for RGB and depth modalities with the same Places or ImageNet pre-training. However, the pre-trained depth network is still biased by RGB-based models which may result in a suboptimal solution. In this paper, we present a single-model self-supervised hybrid pre-training framework for RGB and depth modalities, termed as CoMAE. Our CoMAE presents a curriculum learning strategy to unify the two popular self-supervised representation learning algorithms: contrastive learning and masked image modeling. Specifically, we first build a patch-level alignment task to pre-train a single encoder shared by two modalities via cross-modal contrastive learning. Then, the pre-trained contrastive encoder is passed to a multi-modal masked autoencoder to capture the finer context features from a generative perspective. In addition, our single-model design without requirement of fusion module is very flexible and robust to generalize to unimodal scenario in both training and testing phases. Extensive experiments on SUN RGB-D and NYUDv2 datasets demonstrate the effectiveness of our CoMAE for RGB and depth representation learning. In addition, our experiment results reveal that CoMAE is a data-efficient representation learner. Although we only use the small-scale and unlabeled training set for pre-training, our CoMAE pre-trained models are still competitive to the state-of-the-art methods with extra large-scale and supervised RGB dataset pre-training. Code will be released at https://github.com/MCG-NJU/CoMAE.
翻译:当前的RGB-D场景识别方法通常为RGB和深度模态训练两个独立的骨干网络,并采用相同的Places或ImageNet预训练。然而,预训练的深度网络仍受基于RGB模型的偏见影响,可能导致次优解。本文提出一种面向RGB和深度模态的单模型自监督混合预训练框架,称为CoMAE。该框架引入课程学习策略,统一了两种流行的自监督表示学习算法:对比学习和掩码图像建模。具体地,我们首先构建块级对齐任务,通过跨模态对比学习预训练一个由两种模态共享的单一编码器;接着,将预训练的对比编码器传入多模态掩码自编码器,从生成视角捕捉更精细的上下文特征。此外,我们的单模型设计无需融合模块,在训练和测试阶段均能灵活且鲁棒地泛化到单模态场景。在SUN RGB-D和NYUDv2数据集上的大量实验表明,CoMAE在RGB和深度表示学习上具有有效性。同时,实验结果显示CoMAE是一种数据高效的表示学习器。尽管仅使用小规模无标签训练集进行预训练,CoMAE预训练模型仍能与利用额外大规模有监督RGB数据集预训练的最先进方法相媲美。代码将发布于https://github.com/MCG-NJU/CoMAE。