Monocular 3D object detection offers a low-cost alternative to LiDAR, yet remains less accurate due to the difficulty of estimating metric depth from a single image. We systematically evaluate how depth backbones and feature engineering affect a monocular Pseudo-LiDAR pipeline on the KITTI validation split. Specifically, we compare NeWCRFs (supervised metric depth) against Depth Anything V2 Metric-Outdoor (Base) under an identical pseudo-LiDAR generation and PointRCNN detection protocol. NeWCRFs yields stronger downstream 3D detection, achieving 10.50\% AP$_{3D}$ at IoU$=0.7$ on the Moderate split using grayscale intensity (Exp~2). We further test point-cloud augmentations using appearance cues (grayscale intensity) and semantic cues (instance segmentation confidence). Contrary to the expectation that semantics would substantially close the gap, these features provide only marginal gains, and mask-based sampling can degrade performance by removing contextual geometry. Finally, we report a depth-accuracy-versus-distance diagnostic using ground-truth 2D boxes (including Ped/Cyc), highlighting that coarse depth correctness does not fully predict strict 3D IoU. Overall, under an off-the-shelf LiDAR detector, depth-backbone choice and geometric fidelity dominate performance, outweighing secondary feature injection.
翻译:单目三维物体检测为激光雷达提供了一种低成本替代方案,但由于从单幅图像估计度量深度的困难,其精度仍然较低。我们在KITTI验证集上系统评估了深度主干网络与特征工程如何影响单目伪激光雷达流程。具体而言,在相同的伪激光雷达生成与PointRCNN检测协议下,我们比较了NeWCRFs(监督式度量深度)与Depth Anything V2 Metric-Outdoor(基础版)。NeWCRFs在下游三维检测中表现更优,在使用灰度强度时(实验2),在Moderate子集上以IoU$=0.7$达到了10.50\%的AP$_{3D}$。我们进一步测试了利用外观线索(灰度强度)与语义线索(实例分割置信度)的点云增强方法。与预期语义信息能显著缩小差距相反,这些特征仅带来边际收益,而基于掩码的采样可能因移除上下文几何结构而降低性能。最后,我们使用真实二维边界框(包括行人/自行车)报告了深度精度随距离变化的诊断分析,指出粗略的深度正确性并不能完全预测严格的三维IoU。总体而言,在现成的激光雷达检测器下,深度主干网络的选择与几何保真度主导性能表现,其重要性超过次要特征注入。