SGV3D:Towards Scenario Generalization for Vision-based Roadside 3D Object Detection

Roadside perception can greatly increase the safety of autonomous vehicles by extending their perception ability beyond the visual range and addressing blind spots. However, current state-of-the-art vision-based roadside detection methods possess high accuracy on labeled scenes but have inferior performance on new scenes. This is because roadside cameras remain stationary after installation and can only collect data from a single scene, resulting in the algorithm overfitting these roadside backgrounds and camera poses. To address this issue, in this paper, we propose an innovative Scenario Generalization Framework for Vision-based Roadside 3D Object Detection, dubbed SGV3D. Specifically, we employ a Background-suppressed Module (BSM) to mitigate background overfitting in vision-centric pipelines by attenuating background features during the 2D to bird's-eye-view projection. Furthermore, by introducing the Semi-supervised Data Generation Pipeline (SSDG) using unlabeled images from new scenes, diverse instance foregrounds with varying camera poses are generated, addressing the risk of overfitting specific camera poses. We evaluate our method on two large-scale roadside benchmarks. Our method surpasses all previous methods by a significant margin in new scenes, including +42.57% for vehicle, +5.87% for pedestrian, and +14.89% for cyclist compared to BEVHeight on the DAIR-V2X-I heterologous benchmark. On the larger-scale Rope3D heterologous benchmark, we achieve notable gains of 14.48% for car and 12.41% for large vehicle. We aspire to contribute insights on the exploration of roadside perception techniques, emphasizing their capability for scenario generalization. The code will be available at https://github.com/yanglei18/SGV3D

翻译：路侧感知通过扩展自动驾驶车辆的可视范围并解决盲区问题，能够显著提升其安全性。然而，当前最先进的基于视觉的路侧检测方法在标注场景中具有高精度，但在新场景中性能较差。这是由于路侧摄像头安装后保持静止，仅能采集单一场景数据，导致算法过拟合于这些路侧背景与相机位姿。针对此问题，本文提出了一种创新的面向视觉路侧3D目标检测的场景泛化框架，即SGV3D。具体而言，我们采用背景抑制模块（BSM），通过在从2D到鸟瞰图投影过程中衰减背景特征，减轻视觉中心化流水线中的背景过拟合问题。此外，通过引入基于新场景未标注图像的半监督数据生成流水线（SSDG），生成具有不同相机位姿的多样化实例前景，从而解决特定相机位姿的过拟合风险。我们在两个大规模路侧基准数据集上评估了本方法。在新场景中，本方法显著超越所有先前方法：与DAIR-V2X-I异源基准数据集上的BEVHeight方法相比，车辆、行人与骑行者检测结果分别提升42.57%、5.87%和14.89%；在更大规模的Rope3D异源基准数据集上，轿车与大型车辆检测结果分别提升14.48%和12.41%。我们期望为路侧感知技术探索提供见解，强调其场景泛化能力。相关代码将开源于 https://github.com/yanglei18/SGV3D。