Self-supervised depth estimation from monocular sequences relies on the joint learning of a depth and a pose network. Despite abundant research done to improve the depth network, efforts on the pose remain limited. In this context, even when depth is estimated up to scale, we highlight the importance of the alignment between the scene scales estimated by the pose and depth nets. Then, we introduce SA4Depth, an approach to improve this alignment and boost the depth predictions while keeping the inference time unchanged. Our proposed method uses the depth estimated during training to reproject learnable visual features across consecutive frames and refine the pose estimates by reducing feature alignment residuals. With our method, the estimated scene scales by the separate depth and pose networks are aligned, and the prediction scale consistency is improved across different sequences. Our differentiable refinement integrates seamlessly into existing self-supervised pipelines and substantially improves their depth estimates. We demonstrate this with extensive experiments both outdoors and indoors on KITTI, Cityscapes, and NYUv2. Additionally, results on KITTI Odometry confirm the effectiveness of our pose refinement. Our code is available at https://github.com/Runningchauncey/SA4Depth .
翻译:从单目序列进行自监督深度估计依赖于深度网络和位姿网络的联合学习。尽管已有大量研究致力于改进深度网络,但对位姿的优化工作仍然有限。在此背景下,即使深度估计是尺度不确定的,我们强调位姿网络与深度网络所估计的场景尺度之间对齐的重要性。为此,我们提出SA4Depth方法,旨在改进这种对齐并提升深度预测精度,同时保持推理时间不变。所提出的方法利用训练期间估计的深度,将可学习的视觉特征重投影到连续帧上,并通过减少特征对齐残差来优化位姿估计。通过我们的方法,由独立深度网络和位姿网络估计的场景尺度得以对齐,且不同序列间的预测尺度一致性得到提升。所设计的可微分优化模块可无缝集成到现有自监督流程中,并显著改善其深度估计效果。我们在室外数据集KITTI、Cityscapes以及室内数据集NYUv2上进行了大量实验验证。此外,在KITTI Odometry上的结果证实了我们的位姿优化方法的有效性。代码开源于https://github.com/Runningchauncey/SA4Depth。