Multi-view stereo depth estimation based on cost volume usually works better than self-supervised monocular depth estimation except for moving objects and low-textured surfaces. So in this paper, we propose a multi-frame depth estimation framework which monocular depth can be refined continuously by multi-frame sequential constraints, leveraging a Bayesian fusion layer within several iterations. Both monocular and multi-view networks can be trained with no depth supervision. Our method also enhances the interpretability when combining monocular estimation with multi-view cost volume. Detailed experiments show that our method surpasses state-of-the-art unsupervised methods utilizing single or multiple frames at test time on KITTI benchmark.
翻译:基于代价体的多视图立体深度估计通常在运动物体和低纹理表面上优于自监督单目深度估计。因此,本文提出一种多帧深度估计框架,通过多帧序列约束,利用贝叶斯融合层在数次迭代中持续优化单目深度。单目网络与多视图网络均可在无深度监督条件下训练。本方法在融合单目估计与多视图代价体的过程中,增强了模型的可解释性。详细实验表明,本方法在KITTI基准测试中超越了使用单帧或多帧测试的无监督最新方法。