Multi-frame methods improve monocular depth estimation over single-frame approaches by aggregating spatial-temporal information via feature matching. However, the spatial-temporal feature leads to accuracy degradation in dynamic scenes. To enhance the performance, recent methods tend to propose complex architectures for feature matching and dynamic scenes. In this paper, we show that a simple learning framework, together with designed feature augmentation, leads to superior performance. (1) A novel dynamic objects detecting method with geometry explainability is proposed. The detected dynamic objects are excluded during training, which guarantees the static environment assumption and relieves the accuracy degradation problem of the multi-frame depth estimation. (2) Multi-scale feature fusion is proposed for feature matching in the multi-frame depth network, which improves feature matching, especially between frames with large camera motion. (3) The robust knowledge distillation with a robust teacher network and reliability guarantee is proposed, which improves the multi-frame depth estimation without computation complexity increase during the test. The experiments show that our proposed methods achieve great performance improvement on the multi-frame depth estimation.
翻译:多帧方法通过特征匹配聚合时空信息,相比单帧方法提升了单目深度估计性能。然而,时空特征在动态场景中会导致精度下降。为提升性能,近期方法倾向于构建复杂的特征匹配与动态场景处理架构。本文证明,一个简洁的学习框架结合精心设计的特征增强方法能够实现更优性能:(1)提出一种基于几何可解释性的新型动态物体检测方法,在训练过程中排除检测到的动态物体,从而保证静态环境假设成立,缓解多帧深度估计的精度退化问题;(2)针对多帧深度网络中的特征匹配提出多尺度特征融合方法,有效提升特征匹配质量,尤其适用于大相机运动帧间场景;(3)提出基于鲁棒教师网络与可靠性保证的鲁棒知识蒸馏方法,在不增加测试阶段计算复杂度的前提下提升多帧深度估计性能。实验结果表明,所提方法在多帧深度估计任务中取得了显著的性能提升。