Masked autoencoding has become a successful pretraining paradigm for Transformer models for text, images, and, recently, point clouds. Raw automotive datasets are suitable candidates for self-supervised pre-training as they generally are cheap to collect compared to annotations for tasks like 3D object detection (OD). However, the development of masked autoencoders for point clouds has focused solely on synthetic and indoor data. Consequently, existing methods have tailored their representations and models toward small and dense point clouds with homogeneous point densities. In this work, we study masked autoencoding for point clouds in an automotive setting, which are sparse and for which the point density can vary drastically among objects in the same scene. To this end, we propose Voxel-MAE, a simple masked autoencoding pre-training scheme designed for voxel representations. We pre-train the backbone of a Transformer-based 3D object detector to reconstruct masked voxels and to distinguish between empty and non-empty voxels. Our method improves the 3D OD performance by 1.75 mAP points and 1.05 NDS on the challenging nuScenes dataset. Further, we show that by pre-training with Voxel-MAE, we require only 40% of the annotated data to outperform a randomly initialized equivalent. Code available at https://github.com/georghess/voxel-mae
翻译:掩码自编码已成为Transformer模型在文本、图像以及近期点云领域成功的预训练范式。原始车载数据集是自监督预训练的合适候选,因为与3D目标检测(OD)等任务的标注相比,其采集成本通常较低。然而,点云掩码自编码器的开发目前仅聚焦于合成数据和室内数据。因此,现有方法将表征与模型限定为具有均匀点密度的小型密集点云。本研究在车载场景中探究点云的掩码自编码,此类点云具有稀疏性且同一场景中不同物体的点密度可能存在显著差异。为此,我们提出Voxel-MAE——一种专为体素表征设计的简单掩码自编码预训练方案。我们对基于Transformer的3D目标检测器主干网络进行预训练,使其能够重建被掩码的体素,并区分空体素与非空体素。在具有挑战性的nuScenes数据集上,本方法将3D目标检测性能提升1.75 mAP点和1.05 NDS。进一步研究表明,通过Voxel-MAE预训练,仅需40%的标注数据即可超越随机初始化的同类模型。代码开源地址:https://github.com/georghess/voxel-mae