Curriculum learning based pre-training using Multi-Modal Contrastive Masked Autoencoders

In this paper, we propose a new pre-training method for image understanding tasks under Curriculum Learning (CL) paradigm which leverages RGB-D. The method utilizes Multi-Modal Contrastive Masked Autoencoder and Denoising techniques. Recent approaches either use masked autoencoding (e.g., MultiMAE) or contrastive learning(e.g., Pri3D, or combine them in a single contrastive masked autoencoder architecture such as CMAE and CAV-MAE. However, none of the single contrastive masked autoencoder is applicable to RGB-D datasets. To improve the performance and efficacy of such methods, we propose a new pre-training strategy based on CL. Specifically, in the first stage, we pre-train the model using contrastive learning to learn cross-modal representations. In the second stage, we initialize the modality-specific encoders using the weights from the first stage and then pre-train the model using masked autoencoding and denoising/noise prediction used in diffusion models. Masked autoencoding focuses on reconstructing the missing patches in the input modality using local spatial correlations, while denoising learns high frequency components of the input data. Our approach is scalable, robust and suitable for pre-training with limited RGB-D datasets. Extensive experiments on multiple datasets such as ScanNet, NYUv2 and SUN RGB-D show the efficacy and superior performance of our approach. Specifically, we show an improvement of +1.0% mIoU against Mask3D on ScanNet semantic segmentation. We further demonstrate the effectiveness of our approach in low-data regime by evaluating it for semantic segmentation task against the state-of-the-art methods.

翻译：本文提出了一种基于课程学习（CL）范式、利用RGB-D数据的图像理解任务预训练新方法。该方法结合了多模态对比掩码自编码器与去噪技术。现有方法通常单独使用掩码自编码（如MultiMAE）或对比学习（如Pri3D），或在单一对比掩码自编码器架构中结合两者（如CMAE和CAV-MAE），但尚无适用于RGB-D数据集的单一对比掩码自编码器方案。为提升此类方法的性能与效能，我们提出基于CL的新预训练策略：第一阶段通过对比学习预训练模型以学习跨模态表征；第二阶段使用第一阶段权重初始化模态专用编码器，继而通过掩码自编码及扩散模型采用的去噪/噪声预测技术进行预训练。掩码自编码侧重于利用局部空间相关性重建输入模态中的缺失图像块，而去噪则学习输入数据的高频分量。本方法具有可扩展性、鲁棒性，适用于有限RGB-D数据集的预训练。在ScanNet、NYUv2和SUN RGB-D等多个数据集上的大量实验验证了本方法的有效性与优越性能：在ScanNet语义分割任务中较Mask3D提升+1.0% mIoU；通过低数据场景下的语义分割任务与前沿方法对比，进一步证明了本方法的有效性。

相关内容

自编码器

关注 141

自动编码器是一种人工神经网络，用于以无监督的方式学习有效的数据编码。自动编码器的目的是通过训练网络忽略信号“噪声”来学习一组数据的表示（编码），通常用于降维。与简化方面一起，学习了重构方面，在此，自动编码器尝试从简化编码中生成尽可能接近其原始输入的表示形式，从而得到其名称。基本模型存在几种变体，其目的是迫使学习的输入表示形式具有有用的属性。自动编码器可有效地解决许多应用问题，从面部识别到获取单词的语义。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日