CoBEV: Elevating Roadside 3D Object Detection with Depth and Height Complementarity

Roadside camera-driven 3D object detection is a crucial task in intelligent transportation systems, which extends the perception range beyond the limitations of vision-centric vehicles and enhances road safety. While previous studies have limitations in using only depth or height information, we find both depth and height matter and they are in fact complementary. The depth feature encompasses precise geometric cues, whereas the height feature is primarily focused on distinguishing between various categories of height intervals, essentially providing semantic context. This insight motivates the development of Complementary-BEV (CoBEV), a novel end-to-end monocular 3D object detection framework that integrates depth and height to construct robust BEV representations. In essence, CoBEV estimates each pixel's depth and height distribution and lifts the camera features into 3D space for lateral fusion using the newly proposed two-stage complementary feature selection (CFS) module. A BEV feature distillation framework is also seamlessly integrated to further enhance the detection accuracy from the prior knowledge of the fusion-modal CoBEV teacher. We conduct extensive experiments on the public 3D detection benchmarks of roadside camera-based DAIR-V2X-I and Rope3D, as well as the private Supremind-Road dataset, demonstrating that CoBEV not only achieves the accuracy of the new state-of-the-art, but also significantly advances the robustness of previous methods in challenging long-distance scenarios and noisy camera disturbance, and enhances generalization by a large margin in heterologous settings with drastic changes in scene and camera parameters. For the first time, the vehicle AP score of a camera model reaches 80% on DAIR-V2X-I in terms of easy mode. The source code will be made publicly available at https://github.com/MasterHow/CoBEV.

翻译：路侧摄像头驱动的3D目标检测是智能交通系统中的关键任务，其通过扩展视觉中心车辆感知范围的局限性来提升道路安全性。尽管现有研究仅利用深度或高度信息存在局限性，但我们发现深度和高度均至关重要且二者实际上具有互补性。深度特征蕴含精确的几何线索，而高度特征则主要聚焦于不同高度区间的类别区分，本质上提供语义上下文。这一洞见催生了Complementary-BEV（CoBEV）——一种整合深度与高度以构建鲁棒BEV表征的新型端到端单目3D目标检测框架。本质上，CoBEV估计每个像素的深度和高度分布，并通过新提出的两阶段互补特征选择模块将摄像头特征提升至3D空间进行横向融合。此外，无缝集成的BEV特征蒸馏框架可进一步利用融合模态CoBEV教师模型的先验知识提升检测精度。我们在路侧摄像头公开3D检测基准DAIR-V2X-I和Rope3D，以及私有Supremind-Road数据集上开展的广泛实验表明：CoBEV不仅达到了新最优方法的检测精度，更显著提升了先前方法在远距离场景和噪声摄像头干扰下的鲁棒性，并在场景与摄像头参数剧烈变化的异源环境中实现了大幅度的泛化性能增强。首次，摄像头模型在DAIR-V2X-I的简单模式下车辆AP得分达到80%。源代码将在https://github.com/MasterHow/CoBEV 开源。