RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving

Casting semantic segmentation of outdoor LiDAR point clouds as a 2D problem, e.g., via range projection, is an effective and popular approach. These projection-based methods usually benefit from fast computations and, when combined with techniques which use other point cloud representations, achieve state-of-the-art results. Today, projection-based methods leverage 2D CNNs but recent advances in computer vision show that vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks. In this work, we question if projection-based methods for 3D semantic segmentation can benefit from these latest improvements on ViTs. We answer positively but only after combining them with three key ingredients: (a) ViTs are notoriously hard to train and require a lot of training data to learn powerful representations. By preserving the same backbone architecture as for RGB images, we can exploit the knowledge from long training on large image collections that are much cheaper to acquire and annotate than point clouds. We reach our best results with pre-trained ViTs on large image datasets. (b) We compensate ViTs' lack of inductive bias by substituting a tailored convolutional stem for the classical linear embedding layer. (c) We refine pixel-wise predictions with a convolutional decoder and a skip connection from the convolutional stem to combine low-level but fine-grained features of the the convolutional stem with the high-level but coarse predictions of the ViT encoder. With these ingredients, we show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and SemanticKITTI. The code is available at https://github.com/valeoai/rangevit.

翻译：将户外LiDAR点云的语义分割转换为二维问题（例如通过距离投影）是一种有效且流行的方法。这类基于投影的方法通常具有快速计算的优势，且当与其他点云表示技术结合时，能达到最先进的性能。目前，基于投影的方法主要依赖二维CNN，但计算机视觉领域的最新进展表明，视觉Transformer (ViTs) 在许多基于图像的基准测试中取得了最先进的成果。本文旨在探讨基于投影的3D语义分割方法能否受益于ViTs的最新进展。我们的答案是肯定的，但必须结合以下三个关键要素：(a) ViTs以训练困难著称，需要大量训练数据才能学习到强大的表示能力。通过保留与RGB图像相同的主干网络架构，我们可以利用从大规模图像集合（其获取和标注成本远低于点云）的长时训练中获取的知识。在大型图像数据集上预训练的ViT使我们的方法达到最佳效果。(b) 我们通过使用定制的卷积嵌入层替代经典的线性嵌入层，弥补了ViT归纳偏置的不足。(c) 我们通过卷积解码器以及从卷积嵌入层到解码器的跳跃连接来细化像素级预测，从而将卷积嵌入层低层但细粒度的特征与ViT编码器高层但粗粒度的预测相结合。凭借这些要素，我们提出的方法（称为RangeViT）在nuScenes和SemanticKITTI数据集上超越了现有基于投影的方法。代码已开源在https://github.com/valeoai/rangevit。