Self-supervised monocular depth estimation (MDE) has gained popularity for obtaining depth predictions directly from videos. However, these methods often produce scale invariant results, unless additional training signals are provided. Addressing this challenge, we introduce a novel self-supervised metric-scaled MDE model that requires only monocular video data and the camera's mounting position, both of which are readily available in modern vehicles. Our approach leverages planar-parallax geometry to reconstruct scene structure. The full pipeline consists of three main networks, a multi-frame network, a singleframe network, and a pose network. The multi-frame network processes sequential frames to estimate the structure of the static scene using planar-parallax geometry and the camera mounting position. Based on this reconstruction, it acts as a teacher, distilling knowledge such as scale information, masked drivable area, metric-scale depth for the static scene, and dynamic object mask to the singleframe network. It also aids the pose network in predicting a metric-scaled relative pose between two subsequent images. Our method achieved state-of-the-art results for the driving benchmark KITTI for metric-scaled depth prediction. Notably, it is one of the first methods to produce self-supervised metric-scaled depth prediction for the challenging Cityscapes dataset, demonstrating its effectiveness and versatility.
翻译:自监督单目深度估计(MDE)因能直接从视频中获取深度预测而广受欢迎。然而,除非提供额外的训练信号,这些方法通常产生尺度不变的结果。为解决这一挑战,我们提出了一种新颖的自监督度量尺度MDE模型,该模型仅需单目视频数据和相机安装位置,这两者在现代车辆中均易于获取。我们的方法利用平面视差几何来重建场景结构。完整流程包含三个主要网络:多帧网络、单帧网络和位姿网络。多帧网络处理连续帧,利用平面视差几何和相机安装位置估计静态场景的结构。基于此重建结果,它充当教师网络,向单帧网络蒸馏知识,包括尺度信息、掩蔽可行驶区域、静态场景的度量尺度深度以及动态物体掩码。同时,它辅助位姿网络预测两幅连续图像之间的度量尺度相对位姿。我们的方法在驾驶基准数据集KITTI的度量尺度深度预测任务上取得了最先进的结果。值得注意的是,这是首批在具有挑战性的Cityscapes数据集上实现自监督度量尺度深度预测的方法之一,证明了其有效性和通用性。