One of the key problems in 3D object detection is to reduce the accuracy gap between methods based on LiDAR sensors and those based on monocular cameras. A recently proposed framework for monocular 3D detection based on Pseudo-Stereo has received considerable attention in the community. However, so far these two problems are discovered in existing practices, including (1) monocular depth estimation and Pseudo-Stereo detector must be trained separately, (2) Difficult to be compatible with different stereo detectors and (3) the overall calculation is large, which affects the reasoning speed. In this work, we propose an end-to-end, efficient pseudo-stereo 3D detection framework by introducing a Single-View Diffusion Model (SVDM) that uses a few iterations to gradually deliver right informative pixels to the left image. SVDM allows the entire pseudo-stereo 3D detection pipeline to be trained end-to-end and can benefit from the training of stereo detectors. Afterwards, we further explore the application of SVDM in depth-free stereo 3D detection, and the final framework is compatible with most stereo detectors. Among multiple benchmarks on the KITTI dataset, we achieve new state-of-the-art performance.
翻译:摘要:三维目标检测的关键问题之一是缩小基于激光雷达传感器与基于单目相机方法之间的精度差距。近期提出的基于伪立体的单目三维检测框架引起了学界广泛关注。然而,现有实践发现两个问题:(1)单目深度估计与伪立体检测器必须分别训练;(2)难以兼容不同立体检测器;(3)整体计算量较大,影响推理速度。本研究提出一种端到端、高效的伪立体三维检测框架,通过引入单视图扩散模型(SVDM),利用少量迭代逐步将右图像有信息像素传递至左图像。SVDM使整个伪立体三维检测流程能够端到端训练,并可从立体检测器的训练中获益。随后,我们进一步探索SVDM在无深度立体三维检测中的应用,最终框架可兼容大部分立体检测器。在KITTI数据集的多个基准测试中,我们取得了新的最优性能。