In this paper, we propose a new pre-training method for image understanding tasks under Curriculum Learning (CL) paradigm which leverages RGB-D. The method utilizes Multi-Modal Contrastive Masked Autoencoder and Denoising techniques. Recent approaches either use masked autoencoding (e.g., MultiMAE) or contrastive learning(e.g., Pri3D, or combine them in a single contrastive masked autoencoder architecture such as CMAE and CAV-MAE. However, none of the single contrastive masked autoencoder is applicable to RGB-D datasets. To improve the performance and efficacy of such methods, we propose a new pre-training strategy based on CL. Specifically, in the first stage, we pre-train the model using contrastive learning to learn cross-modal representations. In the second stage, we initialize the modality-specific encoders using the weights from the first stage and then pre-train the model using masked autoencoding and denoising/noise prediction used in diffusion models. Masked autoencoding focuses on reconstructing the missing patches in the input modality using local spatial correlations, while denoising learns high frequency components of the input data. Our approach is scalable, robust and suitable for pre-training with limited RGB-D datasets. Extensive experiments on multiple datasets such as ScanNet, NYUv2 and SUN RGB-D show the efficacy and superior performance of our approach. Specifically, we show an improvement of +1.0% mIoU against Mask3D on ScanNet semantic segmentation. We further demonstrate the effectiveness of our approach in low-data regime by evaluating it for semantic segmentation task against the state-of-the-art methods.
翻译:本文提出了一种基于课程学习(CL)范式、利用RGB-D数据的图像理解任务预训练新方法。该方法结合了多模态对比掩码自编码器与去噪技术。现有方法通常单独使用掩码自编码(如MultiMAE)或对比学习(如Pri3D),或在单一对比掩码自编码器架构中结合两者(如CMAE和CAV-MAE),但尚无适用于RGB-D数据集的单一对比掩码自编码器方案。为提升此类方法的性能与效能,我们提出基于CL的新预训练策略:第一阶段通过对比学习预训练模型以学习跨模态表征;第二阶段使用第一阶段权重初始化模态专用编码器,继而通过掩码自编码及扩散模型采用的去噪/噪声预测技术进行预训练。掩码自编码侧重于利用局部空间相关性重建输入模态中的缺失图像块,而去噪则学习输入数据的高频分量。本方法具有可扩展性、鲁棒性,适用于有限RGB-D数据集的预训练。在ScanNet、NYUv2和SUN RGB-D等多个数据集上的大量实验验证了本方法的有效性与优越性能:在ScanNet语义分割任务中较Mask3D提升+1.0% mIoU;通过低数据场景下的语义分割任务与前沿方法对比,进一步证明了本方法的有效性。