3D object detection is an essential perception task in autonomous driving to understand the environments. The Bird's-Eye-View (BEV) representations have significantly improved the performance of 3D detectors with camera inputs on popular benchmarks. However, there still lacks a systematic understanding of the robustness of these vision-dependent BEV models, which is closely related to the safety of autonomous driving systems. In this paper, we evaluate the natural and adversarial robustness of various representative models under extensive settings, to fully understand their behaviors influenced by explicit BEV features compared with those without BEV. In addition to the classic settings, we propose a 3D consistent patch attack by applying adversarial patches in the 3D space to guarantee the spatiotemporal consistency, which is more realistic for the scenario of autonomous driving. With substantial experiments, we draw several findings: 1) BEV models tend to be more stable than previous methods under different natural conditions and common corruptions due to the expressive spatial representations; 2) BEV models are more vulnerable to adversarial noises, mainly caused by the redundant BEV features; 3) Camera-LiDAR fusion models have superior performance under different settings with multi-modal inputs, but BEV fusion model is still vulnerable to adversarial noises of both point cloud and image. These findings alert the safety issue in the applications of BEV detectors and could facilitate the development of more robust models.
翻译:三维目标检测是自动驾驶中理解环境的关键感知任务。基于鸟瞰图(BEV)表示的方法显著提升了使用摄像头输入的3D检测器在主流基准测试上的性能。然而,目前仍缺乏对这些依赖视觉的BEV模型鲁棒性的系统性理解,而这与自动驾驶系统的安全性密切相关。本文在广泛设置下评估了多种代表性模型的自然鲁棒性与对抗鲁棒性,以全面理解显式BEV特征相比非BEV方法对其行为的影响。除经典设置外,我们提出了一种三维一致性补丁攻击方法,通过在3D空间中施加对抗补丁以保障时空一致性,这对自动驾驶场景更具现实性。通过大量实验,我们得出以下发现:1)由于BEV模型具有表达性空间表征,其在不同自然条件和常见数据损坏下比先前方法更稳定;2)BEV模型对对抗噪声更敏感,主要原因是冗余的BEV特征;3)Camera-LiDAR融合模型在多模态输入的不同设置下表现出优越性能,但BEV融合模型仍易受点云和图像双重对抗噪声的攻击。这些发现警示了BEV检测器应用中的安全问题,并有助于推动更鲁棒模型的发展。