The problem of roadside monocular 3D detection requires detecting objects of interested classes in a 2D RGB frame and predicting their 3D information such as locations in bird's-eye-view (BEV). It has broad applications in traffic control, vehicle-vehicle communication, and vehicle-infrastructure cooperative perception. To approach this problem, we present a novel and simple method by prompting the 3D detector using 2D detections. Our method builds on a key insight that, compared with 3D detectors, a 2D detector is much easier to train and performs significantly better w.r.t detections on the 2D image plane. That said, one can exploit 2D detections of a well-trained 2D detector as prompts to a 3D detector, being trained in a way of inflating such 2D detections to 3D towards 3D detection. To construct better prompts using the 2D detector, we explore three techniques: (a) concatenating both 2D and 3D detectors' features, (b) attentively fusing 2D and 3D detectors' features, and (c) encoding predicted 2D boxes x, y, width, height, label and attentively fusing such with the 3D detector's features. Surprisingly, the third performs the best. Moreover, we present a yaw tuning tactic and a class-grouping strategy that merges classes based on their functionality; these techniques improve 3D detection performance further. Comprehensive ablation studies and extensive experiments demonstrate that our method resoundingly outperforms prior works, achieving the state-of-the-art on two large-scale roadside 3D detection benchmarks.
翻译:路边单目3D检测问题要求在一帧2D RGB图像中检测感兴趣类别的目标,并预测其3D信息(如鸟瞰图(BEV)中的位置)。该技术在交通控制、车车通信及车路协同感知等领域具有广泛应用。针对此问题,本文提出一种新颖简洁的方法,通过2D检测结果对3D检测器进行提示。该方法基于一个关键洞察:与3D检测器相比,2D检测器更易训练,且在2D图像平面上的检测性能显著更优。因此,可利用训练完善的2D检测器的检测结果作为3D检测器的提示,通过将2D检测结果扩展至3D空间的方式训练3D检测器。为利用2D检测器构建更优提示,我们探索了三种技术:(a)拼接2D与3D检测器的特征;(b)对2D与3D检测器特征进行注意力融合;(c)编码预测的2D边界框的坐标(x, y)、宽度、高度及标签,并通过注意力机制将其与3D检测器特征融合。令人惊讶的是,第三种方法效果最佳。此外,我们提出偏航角调节策略与类别分组方法,后者按功能合并同类目标;这些技术进一步提升了3D检测性能。全面的消融实验与广泛测试表明,本方法显著优于先前工作,在两个大规模路边3D检测基准上均达到了最优水平。