Masked Autoencoder for Self-Supervised Pre-training on Lidar Point Clouds

Masked autoencoding has become a successful pretraining paradigm for Transformer models for text, images, and, recently, point clouds. Raw automotive datasets are suitable candidates for self-supervised pre-training as they generally are cheap to collect compared to annotations for tasks like 3D object detection (OD). However, the development of masked autoencoders for point clouds has focused solely on synthetic and indoor data. Consequently, existing methods have tailored their representations and models toward small and dense point clouds with homogeneous point densities. In this work, we study masked autoencoding for point clouds in an automotive setting, which are sparse and for which the point density can vary drastically among objects in the same scene. To this end, we propose Voxel-MAE, a simple masked autoencoding pre-training scheme designed for voxel representations. We pre-train the backbone of a Transformer-based 3D object detector to reconstruct masked voxels and to distinguish between empty and non-empty voxels. Our method improves the 3D OD performance by 1.75 mAP points and 1.05 NDS on the challenging nuScenes dataset. Further, we show that by pre-training with Voxel-MAE, we require only 40% of the annotated data to outperform a randomly initialized equivalent. Code available at https://github.com/georghess/voxel-mae

翻译：掩码自编码已成为Transformer模型在文本、图像以及近期点云领域成功的预训练范式。原始车载数据集是自监督预训练的合适候选，因为与3D目标检测（OD）等任务的标注相比，其采集成本通常较低。然而，点云掩码自编码器的开发目前仅聚焦于合成数据和室内数据。因此，现有方法将表征与模型限定为具有均匀点密度的小型密集点云。本研究在车载场景中探究点云的掩码自编码，此类点云具有稀疏性且同一场景中不同物体的点密度可能存在显著差异。为此，我们提出Voxel-MAE——一种专为体素表征设计的简单掩码自编码预训练方案。我们对基于Transformer的3D目标检测器主干网络进行预训练，使其能够重建被掩码的体素，并区分空体素与非空体素。在具有挑战性的nuScenes数据集上，本方法将3D目标检测性能提升1.75 mAP点和1.05 NDS。进一步研究表明，通过Voxel-MAE预训练，仅需40%的标注数据即可超越随机初始化的同类模型。代码开源地址：https://github.com/georghess/voxel-mae

相关内容

自编码器

关注 0

自动编码器是一种人工神经网络，用于以无监督的方式学习有效的数据编码。自动编码器的目的是通过训练网络忽略信号“噪声”来学习一组数据的表示（编码），通常用于降维。与简化方面一起，学习了重构方面，在此，自动编码器尝试从简化编码中生成尽可能接近其原始输入的表示形式，从而得到其名称。基本模型存在几种变体，其目的是迫使学习的输入表示形式具有有用的属性。自动编码器可有效地解决许多应用问题，从面部识别到获取单词的语义。

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

最新《Transformers模型》教程，64页ppt

专知会员服务

326+阅读 · 2020年11月26日