Although both self-supervised single-frame and multi-frame depth estimation methods only require unlabeled monocular videos for training, the information they leverage varies because single-frame methods mainly rely on appearance-based features while multi-frame methods focus on geometric cues. Considering the complementary information of single-frame and multi-frame methods, some works attempt to leverage single-frame depth to improve multi-frame depth. However, these methods can neither exploit the difference between single-frame depth and multi-frame depth to improve multi-frame depth nor leverage multi-frame depth to optimize single-frame depth models. To fully utilize the mutual influence between single-frame and multi-frame methods, we propose a novel self-supervised training framework. Specifically, we first introduce a pixel-wise adaptive depth sampling module guided by single-frame depth to train the multi-frame model. Then, we leverage the minimum reprojection based distillation loss to transfer the knowledge from the multi-frame depth network to the single-frame network to improve single-frame depth. Finally, we regard the improved single-frame depth as a prior to further boost the performance of multi-frame depth estimation. Experimental results on the KITTI and Cityscapes datasets show that our method outperforms existing approaches in the self-supervised monocular setting.
翻译:尽管自监督单帧和多帧深度估计方法仅需无标注的单目视频进行训练,但由于单帧方法主要依赖基于外观的特征,而多帧方法侧重于几何线索,二者所利用的信息存在差异。考虑到单帧与多帧方法的互补信息,部分研究尝试利用单帧深度提升多帧深度。然而,这些方法既无法通过单帧深度与多帧深度的差异改进多帧深度,也无法借助多帧深度优化单帧深度模型。为充分利用单帧与多帧方法之间的相互影响,我们提出一种新型自监督训练框架。具体而言,首先引入以单帧深度引导的像素级自适应深度采样模块训练多帧模型;随后利用基于最小重投影的蒸馏损失,将多帧深度网络的知识迁移至单帧网络以提升单帧深度;最后将改进后的单帧深度作为先验,进一步增强多帧深度估计性能。在KITTI和Cityscapes数据集上的实验结果表明,本方法在自监督单目设置下优于现有方法。