Towards Scenario Generalization for Vision-based Roadside 3D Object Detection

Roadside perception can greatly increase the safety of autonomous vehicles by extending their perception ability beyond the visual range and addressing blind spots. However, current state-of-the-art vision-based roadside detection methods possess high accuracy on labeled scenes but have inferior performance on new scenes. This is because roadside cameras remain stationary after installation and can only collect data from a single scene, resulting in the algorithm overfitting these roadside backgrounds and camera poses. To address this issue, in this paper, we propose an innovative Scenario Generalization Framework for Vision-based Roadside 3D Object Detection, dubbed SGV3D. Specifically, we employ a Background-suppressed Module (BSM) to mitigate background overfitting in vision-centric pipelines by attenuating background features during the 2D to bird's-eye-view projection. Furthermore, by introducing the Semi-supervised Data Generation Pipeline (SSDG) using unlabeled images from new scenes, diverse instance foregrounds with varying camera poses are generated, addressing the risk of overfitting specific camera poses. We evaluate our method on two large-scale roadside benchmarks. Our method surpasses all previous methods by a significant margin in new scenes, including +42.57% for vehicle, +5.87% for pedestrian, and +14.89% for cyclist compared to BEVHeight on the DAIR-V2X-I heterologous benchmark. On the larger-scale Rope3D heterologous benchmark, we achieve notable gains of 14.48% for car and 12.41% for large vehicle. We aspire to contribute insights on the exploration of roadside perception techniques, emphasizing their capability for scenario generalization. The code will be available at {\url{ https://github.com/yanglei18/SGV3D}}

翻译：路侧感知通过扩展自动驾驶车辆的感知范围并解决盲区问题，可显著提升其安全性。然而，当前最先进的基于视觉的路侧检测方法在标注场景中具有高精度，但在新场景中性能较差。这是由于路侧摄像头安装后固定不动，仅能采集单一场景数据，导致算法过拟合这些路侧背景及相机姿态。为解决该问题，本文提出了一种创新的面向场景泛化的基于视觉的路侧三维目标检测框架，简称SGV3D。具体而言，我们采用背景抑制模块（BSM），通过在从二维到鸟瞰图投影过程中衰减背景特征，减轻视觉中心流水线中的背景过拟合。此外，通过引入利用新场景中无标注图像的半监督数据生成流水线（SSDG），生成具有不同相机姿态的多样化实例前景，从而应对特定相机姿态过拟合的风险。我们在两个大规模路侧基准数据集上评估了该方法。在新场景中，我们的方法显著优于所有先前方法，在DAIR-V2X-I异源基准数据集上与BEVHeight相比，车辆提升+42.57%，行人提升+5.87%，骑行者提升+14.89%。在更大规模的Rope3D异源基准数据集中，轿车提升14.48%，大型车辆提升12.41%。我们期望为路侧感知技术的探索提供洞见，强调其场景泛化能力。代码将开源在：{\url{ https://github.com/yanglei18/SGV3D}}。