Masked Autoencoders learn strong visual representations and achieve state-of-the-art results in several independent modalities, yet very few works have addressed their capabilities in multi-modality settings. In this work, we focus on point cloud and RGB image data, two modalities that are often presented together in the real world, and explore their meaningful interactions. To improve upon the cross-modal synergy in existing works, we propose PiMAE, a self-supervised pre-training framework that promotes 3D and 2D interaction through three aspects. Specifically, we first notice the importance of masking strategies between the two sources and utilize a projection module to complementarily align the mask and visible tokens of the two modalities. Then, we utilize a well-crafted two-branch MAE pipeline with a novel shared decoder to promote cross-modality interaction in the mask tokens. Finally, we design a unique cross-modal reconstruction module to enhance representation learning for both modalities. Through extensive experiments performed on large-scale RGB-D scene understanding benchmarks (SUN RGB-D and ScannetV2), we discover it is nontrivial to interactively learn point-image features, where we greatly improve multiple 3D detectors, 2D detectors, and few-shot classifiers by 2.9%, 6.7%, and 2.4%, respectively. Code is available at https://github.com/BLVLab/PiMAE.
翻译:掩码自编码器能够在多种独立模态中学习强视觉表征并取得最先进的结果,但很少有研究探讨其在多模态场景中的能力。本文聚焦于点云与RGB图像数据——这两种常在真实世界中同时出现的模态,并探索它们之间有意义的交互。为改进现有工作中的跨模态协同能力,我们提出PiMAE,一种通过三个方面促进3D与2D交互的自监督预训练框架。具体而言,我们首先注意到两种数据源间掩码策略的重要性,并利用投影模块互补地对齐两模态的掩码与可见标记。随后,我们采用精心设计的双分支MAE流水线,结合新颖的共享解码器,以促进掩码标记中的跨模态交互。最后,我们设计独特的跨模态重建模块,以增强两种模态的表征学习。通过在大型RGB-D场景理解基准(SUN RGB-D与ScanNetV2)上的广泛实验,我们发现交互学习点-图像特征并非易事,而我们的方法在多个3D检测器、2D检测器及少样本分类器上分别提升了2.9%、6.7%和2.4%。代码已开源至https://github.com/BLVLab/PiMAE。