Monocular depth estimation is challenging due to its inherent ambiguity and ill-posed nature, yet it is quite important to many applications. While recent works achieve limited accuracy by designing increasingly complicated networks to extract features with limited spatial geometric cues from a single RGB image, we intend to introduce spatial cues by training a teacher network that leverages left-right image pairs as inputs and transferring the learned 3D geometry-aware knowledge to the monocular student network. Specifically, we present a novel knowledge distillation framework, named ADU-Depth, with the goal of leveraging the well-trained teacher network to guide the learning of the student network, thus boosting the precise depth estimation with the help of extra spatial scene information. To enable domain adaptation and ensure effective and smooth knowledge transfer from teacher to student, we apply both attention-adapted feature distillation and focal-depth-adapted response distillation in the training stage. In addition, we explicitly model the uncertainty of depth estimation to guide distillation in both feature space and result space to better produce 3D-aware knowledge from monocular observations and thus enhance the learning for hard-to-predict image regions. Our extensive experiments on the real depth estimation datasets KITTI and DrivingStereo demonstrate the effectiveness of the proposed method, which ranked 1st on the challenging KITTI online benchmark.
翻译:单目深度估计因其固有的模糊性和不适定性而充满挑战,但对许多应用至关重要。近期研究通过设计日益复杂的网络来从单张RGB图像中提取有限的空间几何线索,仅取得了有限的精度。为此,我们拟通过训练一个利用左右图像对作为输入的教师网络,并将学习到的三维几何感知知识迁移至单目学生网络,从而引入空间线索。具体而言,我们提出了一种名为ADU-Depth的新型知识蒸馏框架,旨在利用精心训练的教师网络指导学生网络的学习,借助额外空间场景信息提升深度估计的精度。为实现域适应并确保教师到学生的知识迁移有效且平滑,我们在训练阶段同时应用了注意力自适应特征蒸馏和焦点深度自适应响应蒸馏。此外,我们显式地建模深度估计的不确定性,以在特征空间和结果空间中引导蒸馏过程,从而更好地从单目观测中生成三维感知知识,并增强对难预测图像区域的学习。在真实深度估计数据集KITTI和DrivingStereo上的大量实验验证了所提方法的有效性,该方法在具有挑战性的KITTI在线基准测试中排名第一。