Occupancy-MAE: Self-supervised Pre-training Large-scale LiDAR Point Clouds with Masked Occupancy Autoencoders

Current perception models in autonomous driving heavily rely on large-scale labelled 3D data, which is both costly and time-consuming to annotate. This work proposes a solution to reduce the dependence on labelled 3D training data by leveraging pre-training on large-scale unlabeled outdoor LiDAR point clouds using masked autoencoders (MAE). While existing masked point autoencoding methods mainly focus on small-scale indoor point clouds or pillar-based large-scale outdoor LiDAR data, our approach introduces a new self-supervised masked occupancy pre-training method called Occupancy-MAE, specifically designed for voxel-based large-scale outdoor LiDAR point clouds. Occupancy-MAE takes advantage of the gradually sparse voxel occupancy structure of outdoor LiDAR point clouds and incorporates a range-aware random masking strategy and a pretext task of occupancy prediction. By randomly masking voxels based on their distance to the LiDAR and predicting the masked occupancy structure of the entire 3D surrounding scene, Occupancy-MAE encourages the extraction of high-level semantic information to reconstruct the masked voxel using only a small number of visible voxels. Extensive experiments demonstrate the effectiveness of Occupancy-MAE across several downstream tasks. For 3D object detection, Occupancy-MAE reduces the labelled data required for car detection on the KITTI dataset by half and improves small object detection by approximately 2% in AP on the Waymo dataset. For 3D semantic segmentation, Occupancy-MAE outperforms training from scratch by around 2% in mIoU. For multi-object tracking, Occupancy-MAE enhances training from scratch by approximately 1% in terms of AMOTA and AMOTP. Codes are publicly available at https://github.com/chaytonmin/Occupancy-MAE.

翻译：当前自动驾驶感知模型高度依赖大规模标注的三维数据，而标注此类数据成本高昂且耗时。本文提出一种解决方案，通过利用掩码自编码器在大规模无标注室外激光雷达点云上进行预训练，以降低对标注三维训练数据的依赖。现有掩码点云自编码方法主要聚焦于小规模室内点云或基于柱状体的大规模室外激光雷达数据，而本文提出一种名为Occupancy-MAE的新型自监督掩码占用预训练方法，专为基于体素的大规模室外激光雷达点云设计。Occupancy-MAE利用室外激光雷达点云中逐渐稀疏的体素占用结构，引入距离感知随机掩码策略以及占用预测前置任务。通过根据体素与激光雷达的距离随机掩码体素，并预测整个三维周围场景的掩码占用结构，Occupancy-MAE鼓励提取高层语义信息，仅利用少量可见体素即可重建被掩码体素。大量实验证明Occupancy-MAE在多个下游任务中的有效性。在三维目标检测任务中，Occupancy-MAE将KITTI数据集上汽车检测所需标注数据量减半，并在Waymo数据集上将小目标检测的平均精度提升约2%。在三维语义分割任务中，Occupancy-MAE相比从头训练在平均交并比上提升约2%。在多目标跟踪任务中，Occupancy-MAE相比从头训练在AMOTA和AMOTP指标上分别提升约1%。代码已开源：https://github.com/chaytonmin/Occupancy-MAE。