Camera-based Bird's Eye View (BEV) perception models receive increasing attention for their crucial role in autonomous driving, a domain where concerns about the robustness and reliability of deep learning have been raised. While only a few works have investigated the effects of randomly generated semantic perturbations, aka natural corruptions, on the multi-view BEV detection task, we develop a black-box robustness evaluation framework that adversarially optimises three common semantic perturbations: geometric transformation, colour shifting, and motion blur, to deceive BEV models, serving as the first approach in this emerging field. To address the challenge posed by optimising the semantic perturbation, we design a smoothed, distance-based surrogate function to replace the mAP metric and introduce SimpleDIRECT, a deterministic optimisation algorithm that utilises observed slopes to guide the optimisation process. By comparing with randomised perturbation and two optimisation baselines, we demonstrate the effectiveness of the proposed framework. Additionally, we provide a benchmark on the semantic robustness of ten recent BEV models. The results reveal that PolarFormer, which emphasises geometric information from multi-view images, exhibits the highest robustness, whereas BEVDet is fully compromised, with its precision reduced to zero.
翻译:基于相机的鸟瞰图感知模型因其在自动驾驶中的关键作用而受到越来越多的关注,而该领域对深度学习的鲁棒性和可靠性存在担忧。尽管已有少数研究探讨了随机生成的语义扰动(即自然退化)对多视角鸟瞰图检测任务的影响,我们开发了一种黑盒鲁棒性评估框架,该框架通过对抗性优化三种常见语义扰动——几何变换、色彩偏移和运动模糊——来欺骗鸟瞰图模型,成为这一新兴领域的首创方法。为应对语义扰动优化的挑战,我们设计了一个基于距离的平滑代理函数以替代mAP指标,并引入了SimpleDIRECT——一种利用观测梯度指导优化过程的确定性优化算法。通过与随机扰动及两种优化基线的比较,我们验证了所提框架的有效性。此外,我们对十种最新鸟瞰图模型的语义鲁棒性进行了基准测试。结果表明:强调多视角图像几何信息的PolarFormer展现出最高的鲁棒性,而BEVDet则完全失效,其检测精度降至零。