The integration of data from diverse sensor modalities (e.g., camera and LiDAR) constitutes a prevalent methodology within the ambit of autonomous driving scenarios. Recent advancements in efficient point cloud transformers have underscored the efficacy of integrating information in sparse formats. When it comes to fusion, since image patches are dense in pixel space with ambiguous depth, it necessitates additional design considerations for effective fusion. In this paper, we conduct a comprehensive exploration of design choices for Transformer-based sparse cameraLiDAR fusion. This investigation encompasses strategies for image-to-3D and LiDAR-to-2D mapping, attention neighbor grouping, single modal tokenizer, and micro-structure of Transformer. By amalgamating the most effective principles uncovered through our investigation, we introduce FlatFusion, a carefully designed framework for sparse camera-LiDAR fusion. Notably, FlatFusion significantly outperforms state-of-the-art sparse Transformer-based methods, including UniTR, CMT, and SparseFusion, achieving 73.7 NDS on the nuScenes validation set with 10.1 FPS with PyTorch.
翻译:在自动驾驶场景中,整合来自不同传感器模态(例如相机和激光雷达)的数据构成了一种普遍采用的方法。高效点云Transformer的最新进展凸显了以稀疏格式集成信息的有效性。在融合方面,由于图像块在像素空间中是密集的且深度信息模糊,因此需要额外的设计考量以实现有效融合。本文对基于Transformer的稀疏相机-激光雷达融合的设计选择进行了全面探索。这项研究涵盖了图像到3D和激光雷达到2D的映射策略、注意力邻域分组、单模态分词器以及Transformer的微观结构。通过整合我们在研究中发现的最有效原则,我们提出了FlatFusion——一个为稀疏相机-激光雷达融合精心设计的框架。值得注意的是,FlatFusion显著优于最先进的基于稀疏Transformer的方法,包括UniTR、CMT和SparseFusion,在PyTorch环境下以10.1 FPS的速度在nuScenes验证集上实现了73.7 NDS。