Leveraging multiple sensors is crucial for robust semantic perception in autonomous driving, as each sensor type has complementary strengths and weaknesses. However, existing sensor fusion methods often treat sensors uniformly across all conditions, leading to suboptimal performance. By contrast, we propose a novel, condition-aware multimodal fusion approach for robust semantic perception of driving scenes. Our method, CAFuser uses an RGB camera input to classify environmental conditions and generate a Condition Token that guides the fusion of multiple sensor modalities. We further newly introduce modality-specific feature adapters to align diverse sensor inputs into a shared latent space, enabling efficient integration with a single and shared pre-trained backbone. By dynamically adapting sensor fusion based on the actual condition, our model significantly improves robustness and accuracy, especially in adverse-condition scenarios. We set the new state of the art with CAFuser on the MUSES dataset with 59.7 PQ for multimodal panoptic segmentation and 78.2 mIoU for semantic segmentation, ranking first on the public benchmarks.
翻译:利用多种传感器对于自动驾驶中的鲁棒语义感知至关重要,因为每种传感器类型都具有互补的优势和局限性。然而,现有的传感器融合方法通常在所有条件下都统一处理传感器,导致性能欠佳。相比之下,我们提出了一种新颖的、条件感知的多模态融合方法,用于驾驶场景的鲁棒语义感知。我们的方法CAFuser使用RGB相机输入对环境条件进行分类,并生成一个条件令牌,以指导多种传感器模态的融合。我们进一步新引入了模态特定的特征适配器,将不同的传感器输入对齐到一个共享的潜在空间中,从而实现与单一且共享的预训练主干网络的高效集成。通过基于实际条件动态调整传感器融合,我们的模型显著提高了鲁棒性和准确性,尤其是在恶劣条件场景下。我们在MUSES数据集上使用CAFuser为多模态全景分割设定了59.7 PQ、为语义分割设定了78.2 mIoU的新技术水平,在公开基准测试中排名第一。