There is a recent trend in the LiDAR perception field towards unifying multiple tasks in a single strong network with improved performance, as opposed to using separate networks for each task. In this paper, we introduce a new LiDAR multi-task learning paradigm based on the transformer. The proposed LiDARFormer utilizes cross-space global contextual feature information and exploits cross-task synergy to boost the performance of LiDAR perception tasks across multiple large-scale datasets and benchmarks. Our novel transformer-based framework includes a cross-space transformer module that learns attentive features between the 2D dense Bird's Eye View (BEV) and 3D sparse voxel feature maps. Additionally, we propose a transformer decoder for the segmentation task to dynamically adjust the learned features by leveraging the categorical feature representations. Furthermore, we combine the segmentation and detection features in a shared transformer decoder with cross-task attention layers to enhance and integrate the object-level and class-level features. LiDARFormer is evaluated on the large-scale nuScenes and the Waymo Open datasets for both 3D detection and semantic segmentation tasks, and it outperforms all previously published methods on both tasks. Notably, LiDARFormer achieves the state-of-the-art performance of 76.4% L2 mAPH and 74.3% NDS on the challenging Waymo and nuScenes detection benchmarks for a single model LiDAR-only method.
翻译:当前LiDAR感知领域呈现一个新趋势,即通过单一强网络统一执行多项任务(而非为每项任务单独设计网络)以提升性能。本文提出一种基于Transformer的新型LiDAR多任务学习范式。所提出的LiDARFormer利用跨空间全局上下文特征信息,并挖掘跨任务协同效应,以提升跨多个大规模数据集和基准的LiDAR感知任务性能。我们的新型Transformer框架包含一个跨空间Transformer模块,该模块可在2D密集鸟瞰图(BEV)和3D稀疏体素特征图之间学习注意力特征。此外,我们针对分割任务提出一种Transformer解码器,通过利用类别特征表示动态调整学习到的特征。更进一步地,我们将分割与检测特征融合于共享的Transformer解码器中,并引入跨任务注意力层,以增强并整合目标级与类别级特征。LiDARFormer在大型nuScenes和Waymo开放数据集上针对3D检测与语义分割两项任务进行了评估,并在这两项任务上均超越了所有已发表方法。值得注意的是,LiDARFormer在极具挑战性的Waymo与nuScenes检测基准上,以单一模型LiDAR-only方法实现了76.4% L2 mAPH和74.3% NDS的顶尖性能。