Monocular depth estimation plays a fundamental role in computer vision. Due to the costly acquisition of depth ground truth, self-supervised methods that leverage adjacent frames to establish a supervisory signal have emerged as the most promising paradigms. In this work, we propose two novel ideas to improve self-supervised monocular depth estimation: 1) self-reference distillation and 2) disparity offset refinement. Specifically, we use a parameter-optimized model as the teacher updated as the training epochs to provide additional supervision during the training process. The teacher model has the same structure as the student model, with weights inherited from the historical student model. In addition, a multiview check is introduced to filter out the outliers produced by the teacher model. Furthermore, we leverage the contextual consistency between high-scale and low-scale features to obtain multiscale disparity offsets, which are used to refine the disparity output incrementally by aligning disparity information at different scales. The experimental results on the KITTI and Make3D datasets show that our method outperforms previous state-of-the-art competitors.
翻译:单目深度估计在计算机视觉中扮演基础性角色。由于深度真值获取成本高昂,利用相邻帧建立监督信号的自监督方法已成为最具潜力的范式。本文提出两项创新思想以改进自监督单目深度估计:1)自参考蒸馏,2)视差偏移优化。具体而言,我们采用参数优化模型作为教师模型,并随训练轮次动态更新,从而在训练过程中提供额外监督。教师模型与学生模型结构相同,其权重继承自历史学生模型。此外,引入多视图校验机制以过滤教师模型产生的异常值。同时,我们利用高尺度与低尺度特征间的上下文一致性获取多尺度视差偏移,通过在不同尺度上对齐视差信息,逐步优化视差输出结果。在KITTI和Make3D数据集上的实验表明,本方法优于先前最先进算法。