CoBEV: Elevating Roadside 3D Object Detection with Depth and Height Complementarity

Roadside camera-driven 3D object detection is a crucial task in intelligent transportation systems, which extends the perception range beyond the limitations of vision-centric vehicles and enhances road safety. While previous studies have limitations in using only depth or height information, we find both depth and height matter and they are in fact complementary. The depth feature encompasses precise geometric cues, whereas the height feature is primarily focused on distinguishing between various categories of height intervals, essentially providing semantic context. This insight motivates the development of Complementary-BEV (CoBEV), a novel end-to-end monocular 3D object detection framework that integrates depth and height to construct robust BEV representations. In essence, CoBEV estimates each pixel's depth and height distribution and lifts the camera features into 3D space for lateral fusion using the newly proposed two-stage complementary feature selection (CFS) module. A BEV feature distillation framework is also seamlessly integrated to further enhance the detection accuracy from the prior knowledge of the fusion-modal CoBEV teacher. We conduct extensive experiments on the public 3D detection benchmarks of roadside camera-based DAIR-V2X-I and Rope3D, as well as the private Supremind-Road dataset, demonstrating that CoBEV not only achieves the accuracy of the new state-of-the-art, but also significantly advances the robustness of previous methods in challenging long-distance scenarios and noisy camera disturbance, and enhances generalization by a large margin in heterologous settings with drastic changes in scene and camera parameters. For the first time, the vehicle AP score of a camera model reaches 80% on DAIR-V2X-I in terms of easy mode. The source code will be made publicly available at https://github.com/MasterHow/CoBEV.

翻译：摘要：路边摄像头驱动的3D目标检测是智能交通系统中的关键任务，它扩展了以视觉为中心的车辆感知范围的局限性，从而提升了道路安全。尽管先前的研究在使用深度或高度信息方面存在局限，但我们发现深度和高度都至关重要，且二者实际上具有互补性。深度特征包含精确的几何线索，而高度特征则主要侧重于区分不同类别的高度区间，本质上是提供语义上下文。这一发现促使我们开发了互补型BEV（CoBEV），这是一种新颖的端到端单目3D目标检测框架，通过融合深度与高度来构建鲁棒的BEV表示。本质上，CoBEV估计每个像素的深度和高度分布，并通过新提出的两阶段互补特征选择（CFS）模块将摄像头特征提升至3D空间进行横向融合。此外，我们还无缝集成了一个BEV特征蒸馏框架，利用融合模态CoBEV教师模型的先验知识进一步提升检测精度。我们在基于路边摄像头的DAIR-V2X-I和Rope3D公开3D检测基准以及私有Supremind-Road数据集上进行了广泛实验，结果表明CoBEV不仅达到了新的最先进精度，还显著提升了先前方法在挑战性远距离场景和嘈杂摄像头干扰下的鲁棒性，并在场景和摄像头参数剧烈变化的异质设置中大幅增强了泛化能力。在DAIR-V2X-I数据集上，相机模型首次在简单模式下车辆AP得分达到80%。源代码将在https://github.com/MasterHow/CoBEV公开提供。