Robust semantic perception for autonomous vehicles relies on effectively combining multiple sensors with complementary strengths and weaknesses. State-of-the-art sensor fusion approaches to semantic perception often treat sensor data uniformly across the spatial extent of the input, which hinders performance when faced with challenging conditions. By contrast, we propose a novel depth-guided multimodal fusion method that upgrades condition-aware fusion by integrating depth information. Our network, DGFusion, poses multimodal segmentation as a multi-task problem, utilizing the lidar measurements, which are typically available in outdoor sensor suites, both as one of the model's inputs and as ground truth for learning depth. Our corresponding auxiliary depth head helps to learn depth-aware features, which are encoded into spatially varying local depth tokens that condition our attentive cross-modal fusion. Together with a global condition token, these local depth tokens dynamically adapt sensor fusion to the spatially varying reliability of each sensor across the scene, which largely depends on depth. In addition, we propose a robust loss for our depth, which is essential for learning from lidar inputs that are typically sparse and noisy in adverse conditions. Our method achieves state-of-the-art panoptic and semantic segmentation performance on the challenging MUSES and DeLiVER datasets. Code and models are available at https://github.com/timbroed/DGFusion
翻译:自动驾驶的鲁棒语义感知依赖于有效结合具有互补优势与不足的多种传感器。当前最先进的语义感知传感器融合方法通常在输入的空间范围内统一处理传感器数据,这在面对挑战性条件时会限制性能表现。相比之下,我们提出了一种新颖的深度引导多模态融合方法,该方法通过整合深度信息来升级条件感知融合。我们的网络DGFusion将多模态分割构建为一个多任务问题,利用通常存在于户外传感器套件中的激光雷达测量数据,既作为模型的输入之一,也作为学习深度的真实标签。我们对应的辅助深度头部有助于学习深度感知特征,这些特征被编码为空间变化的局部深度标记,用于调节我们的注意力跨模态融合。这些局部深度标记与一个全局条件标记共同作用,能够根据场景中随空间位置变化的各传感器可靠性(这在很大程度上取决于深度)动态调整传感器融合策略。此外,我们为深度预测提出了一种鲁棒损失函数,这对于从通常在恶劣条件下稀疏且含噪的激光雷达输入中学习至关重要。我们的方法在具有挑战性的MUSES和DeLiVER数据集上实现了最先进的全景与语义分割性能。代码与模型已发布于 https://github.com/timbroed/DGFusion