Perceiving 3D objects from monocular inputs is crucial for robotic systems, given its economy compared to multi-sensor settings. It is notably difficult as a single image can not provide any clues for predicting absolute depth values. Motivated by binocular methods for 3D object detection, we take advantage of the strong geometry structure provided by camera ego-motion for accurate object depth estimation and detection. We first make a theoretical analysis on this general two-view case and notice two challenges: 1) Cumulative errors from multiple estimations that make the direct prediction intractable; 2) Inherent dilemmas caused by static cameras and matching ambiguity. Accordingly, we establish the stereo correspondence with a geometry-aware cost volume as the alternative for depth estimation and further compensate it with monocular understanding to address the second problem. Our framework, named Depth from Motion (DfM), then uses the established geometry to lift 2D image features to the 3D space and detects 3D objects thereon. We also present a pose-free DfM to make it usable when the camera pose is unavailable. Our framework outperforms state-of-the-art methods by a large margin on the KITTI benchmark. Detailed quantitative and qualitative analyses also validate our theoretical conclusions. The code will be released at https://github.com/Tai-Wang/Depth-from-Motion.
翻译:从单目输入中感知三维物体对机器人系统至关重要,因其相比多传感器设置更具经济性。但单张图像无法提供预测绝对深度值的线索,使得这一任务尤为困难。受双目三维目标检测方法的启发,我们利用相机自运动提供的强几何结构来实现精确的物体深度估计与检测。首先对此类通用双视图场景进行理论分析,发现两个挑战:1)多步估计的累积误差使得直接预测难以实现;2)静态相机与匹配歧义导致固有困境。为此,我们构建基于几何感知代价体积的立体对应关系作为深度估计的替代方案,并进一步结合单目理解来应对第二个问题。所提框架名为"基于运动的深度(Depth from Motion, DfM)",利用建立的几何约束将二维图像特征提升至三维空间,并在此空间中进行三维目标检测。我们还提出无位姿DfM变体,使其在相机位姿不可用时仍可应用。该方法在KITTI基准上以显著优势超越现有最优方法。详尽的定量与定性分析也验证了我们的理论结论。代码将发布于https://github.com/Tai-Wang/Depth-from-Motion。