UniBEV: Multi-modal 3D Object Detection with Uniform BEV Encoders for Robustness against Missing Sensor Modalities

Multi-sensor object detection is an active research topic in automated driving, but the robustness of such detection models against missing sensor input (modality missing), e.g., due to a sudden sensor failure, is a critical problem which remains under-studied. In this work, we propose UniBEV, an end-to-end multi-modal 3D object detection framework designed for robustness against missing modalities: UniBEV can operate on LiDAR plus camera input, but also on LiDAR-only or camera-only input without retraining. To facilitate its detector head to handle different input combinations, UniBEV aims to create well-aligned Bird's Eye View (BEV) feature maps from each available modality. Unlike prior BEV-based multi-modal detection methods, all sensor modalities follow a uniform approach to resample features from the native sensor coordinate systems to the BEV features. We furthermore investigate the robustness of various fusion strategies w.r.t. missing modalities: the commonly used feature concatenation, but also channel-wise averaging, and a generalization to weighted averaging termed Channel Normalized Weights. To validate its effectiveness, we compare UniBEV to state-of-the-art BEVFusion and MetaBEV on nuScenes over all sensor input combinations. In this setting, UniBEV achieves $52.5 \%$ mAP on average over all input combinations, significantly improving over the baselines ($43.5 \%$ mAP on average for BEVFusion, $48.7 \%$ mAP on average for MetaBEV). An ablation study shows the robustness benefits of fusing by weighted averaging over regular concatenation, and of sharing queries between the BEV encoders of each modality. Our code will be released upon paper acceptance.

翻译：多传感器目标检测是自动驾驶领域的研究热点，但此类检测模型在传感器输入缺失（模态缺失，如突发传感器故障）情况下的鲁棒性仍是亟待解决的关键问题。本文提出UniBEV——一种面向缺失模态鲁棒性的端到端多模态3D目标检测框架：UniBEV可基于LiDAR与摄像头联合输入运行，也可在无需重训练的条件下仅凭LiDAR或摄像头输入运行。为增强检测头对不同输入组合的适应能力，UniBEV致力于从每种可用模态生成高度对齐的鸟瞰图（BEV）特征图。与现有基于BEV的多模态检测方法不同，本方法中所有传感器模态均采用统一范式，将原始传感器坐标系中的特征重采样至BEV特征空间。我们进一步研究了不同融合策略对缺失模态的鲁棒性：包括常用的特征拼接、通道级平均加权，以及泛化的加权平均方法（通道归一化权重）。为验证有效性，我们在nuScenes数据集上所有传感器输入组合下，将UniBEV与当前最优的BEVFusion及MetaBEV进行对比。在该设定下，UniBEV在所有输入组合上的平均mAP达到52.5%，显著优于基线方法（BEVFusion平均43.5%，MetaBEV平均48.7%）。消融研究表明，采用加权平均融合（相对于常规拼接）以及各模态BEV编码器共享查询机制可显著提升鲁棒性。代码将在论文接收后开源。