UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks essential for autonomous driving. In real-world driving scenarios, it's commonplace to deploy multiple sensors for comprehensive environment perception. While integrating multi-modal features from these sensors can produce rich and powerful features, there is a noticeable gap in MAE methods addressing this integration. This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving, aiming to pioneer a more efficient fusion of two distinct modalities. To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, the UniM$^2$AE is proposed. This model stands as a potent yet straightforward, multi-modal self-supervised pre-training framework, mainly consisting of two designs. First, it projects the features from both modalities into a cohesive 3D volume space, ingeniously expanded from the bird's eye view (BEV) to include the height dimension. The extension makes it possible to back-project the informative features, obtained by fusing features from both modalities, into their native modalities to reconstruct the multiple masked inputs. Second, the Multi-modal 3D Interactive Module (MMIM) is invoked to facilitate the efficient inter-modal interaction during the interaction process. Extensive experiments conducted on the nuScenes Dataset attest to the efficacy of UniM$^2$AE, indicating enhancements in 3D object detection and BEV map segmentation by 1.2\%(NDS) and 6.5\% (mIoU), respectively. Code is available at https://github.com/hollow-503/UniM2AE.

翻译：掩码自编码器（MAE）在学习强表征方面发挥着关键作用，在自动驾驶所需的多种3D感知任务中取得了出色成果。在真实驾驶场景中，部署多个传感器以实现全方位环境感知已成为常态。尽管融合来自这些传感器的多模态特征能够产生丰富而强大的特征，但现有MAE方法在应对这种融合方面存在明显不足。本研究深入探索了面向自动驾驶统一表征空间的多模态掩码自编码器，旨在开创两种不同模态更高效的融合方式。为巧妙融合图像固有的语义信息与LiDAR点云的几何复杂性，本文提出了UniM$^2$AE模型。该模型作为一个强大且简洁的多模态自监督预训练框架，主要由两个设计构成。首先，它通过将两种模态的特征投影到统一的3D体素空间中——该空间由鸟瞰图（BEV）创造性地扩展至包含高度维度——使得通过融合双模态特征获得的信息特征能够被反向投影回其原始模态，以重建多个掩码输入。其次，本研究引入多模态3D交互模块（MMIM），以促进交互过程中高效的模态间交互。在nuScenes数据集上进行的大量实验验证了UniM$^2$AE的有效性，其在3D目标检测和BEV地图分割任务中分别实现了1.2%（NDS）和6.5%（mIoU）的性能提升。代码已开源至https://github.com/hollow-503/UniM2AE。