Multi-modal 3D object detectors are dedicated to exploring secure and reliable perception systems for autonomous driving (AD).Although achieving state-of-the-art (SOTA) performance on clean benchmark datasets, they tend to overlook the complexity and harsh conditions of real-world environments. With the emergence of visual foundation models (VFMs), opportunities and challenges are presented for improving the robustness and generalization of multi-modal 3D object detection in AD. Therefore, we propose RoboFusion, a robust framework that leverages VFMs like SAM to tackle out-of-distribution (OOD) noise scenarios. We first adapt the original SAM for AD scenarios named SAM-AD. To align SAM or SAM-AD with multi-modal methods, we then introduce AD-FPN for upsampling the image features extracted by SAM. We employ wavelet decomposition to denoise the depth-guided images for further noise reduction and weather interference. At last, we employ self-attention mechanisms to adaptively reweight the fused features, enhancing informative features while suppressing excess noise. In summary, RoboFusion significantly reduces noise by leveraging the generalization and robustness of VFMs, thereby enhancing the resilience of multi-modal 3D object detection. Consequently, RoboFusion achieves SOTA performance in noisy scenarios, as demonstrated by the KITTI-C and nuScenes-C benchmarks. Code is available at https://github.com/adept-thu/RoboFusion.
翻译:多模态三维物体检测器致力于探索自动驾驶中安全可靠的感知系统。尽管在纯净基准数据集上取得了最先进的性能,但它们往往忽视了真实环境的复杂性和恶劣条件。随着视觉基础模型的出现,自动驾驶中多模态三维物体检测的鲁棒性和泛化能力迎来了机遇与挑战。为此,我们提出RoboFusion——一种利用SAM等视觉基础模型应对分布外噪声场景的鲁棒框架。首先,我们将原始SAM适配至自动驾驶场景,称为SAM-AD;为使其与多模态方法对齐,引入AD-FPN对SAM提取的图像特征进行上采样。采用小波分解对深度引导图像进行去噪处理,以减少噪声和天气干扰;最后,利用自注意力机制自适应地重新加权融合特征,在增强信息特征的同时抑制冗余噪声。综上,RoboFusion通过结合视觉基础模型的泛化性与鲁棒性显著降低噪声,从而提升多模态三维物体检测的抗干扰能力。在KITTI-C和nuScenes-C基准测试中,RoboFusion在噪声场景下实现了最先进性能。代码开源于https://github.com/adept-thu/RoboFusion。