Jointly processing information from multiple sensors is crucial to achieving accurate and robust perception for reliable autonomous driving systems. However, current 3D perception research follows a modality-specific paradigm, leading to additional computation overheads and inefficient collaboration between different sensor data. In this paper, we present an efficient multi-modal backbone for outdoor 3D perception named UniTR, which processes a variety of modalities with unified modeling and shared parameters. Unlike previous works, UniTR introduces a modality-agnostic transformer encoder to handle these view-discrepant sensor data for parallel modal-wise representation learning and automatic cross-modal interaction without additional fusion steps. More importantly, to make full use of these complementary sensor types, we present a novel multi-modal integration strategy by both considering semantic-abundant 2D perspective and geometry-aware 3D sparse neighborhood relations. UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks. It sets a new state-of-the-art performance on the nuScenes benchmark, achieving +1.1 NDS higher for 3D object detection and +12.0 higher mIoU for BEV map segmentation with lower inference latency. Code will be available at https://github.com/Haiyang-W/UniTR .
翻译:联合处理来自多个传感器的信息对于实现准确、鲁棒的感知以构建可靠的自动驾驶系统至关重要。然而,当前的3D感知研究遵循模态特定范式,导致额外的计算开销以及不同传感器数据之间的低效协同。本文提出了一种用于室外3D感知的高效多模态骨干网络UniTR,该网络通过统一建模和共享参数处理多种模态。与以往工作不同,UniTR引入了一种模态无关的Transformer编码器来处理这些视角不一致的传感器数据,实现并行的模态级表征学习和自动跨模态交互,无需额外的融合步骤。更关键的是,为了充分利用这些互补的传感器类型,我们提出了一种新颖的多模态集成策略,同时考虑了语义丰富的2D视角和几何感知的3D稀疏邻域关系。UniTR本质上也是一种任务无关的骨干网络,能够自然支持不同的3D感知任务。它在nuScenes基准上取得了新的最优性能,3D目标检测的NDS提升了+1.1,BEV地图分割的mIoU提升了+12.0,同时推理延迟更低。代码将在https://github.com/Haiyang-W/UniTR 公开。