Monocular depth estimation plays a fundamental role in computer vision. Due to the costly acquisition of depth ground truth, self-supervised methods that leverage adjacent frames to establish a supervisory signal have emerged as the most promising paradigms. In this work, we propose two novel ideas to improve self-supervised monocular depth estimation: 1) self-reference distillation and 2) disparity offset refinement. Specifically, we use a parameter-optimized model as the teacher updated as the training epochs to provide additional supervision during the training process. The teacher model has the same structure as the student model, with weights inherited from the historical student model. In addition, a multiview check is introduced to filter out the outliers produced by the teacher model. Furthermore, we leverage the contextual consistency between high-scale and low-scale features to obtain multiscale disparity offsets, which are used to refine the disparity output incrementally by aligning disparity information at different scales. The experimental results on the KITTI and Make3D datasets show that our method outperforms previous state-of-the-art competitors.
翻译:单目深度估计在计算机视觉中扮演着基础性角色。由于深度真值获取成本高昂,利用相邻帧构建监督信号的自监督方法已成为最具前景的研究范式。本文提出两种创新思路以改进自监督单目深度估计:1)自参考蒸馏;2)视差偏移优化。具体而言,我们采用参数优化后的模型作为教师网络,该网络随训练周期动态更新,为训练过程提供额外监督。教师网络与学生网络具有相同结构,其权重继承自历史阶段的学生模型。此外,我们引入多视图校验机制以过滤教师网络产生的异常值。更进一步,我们利用高尺度与低尺度特征间的上下文一致性获取多尺度视差偏移,通过对齐不同尺度的视差信息逐步优化视差输出。在KITTI和Make3D数据集上的实验结果表明,本方法显著优于现有最优竞争者。