In this paper, we propose a novel video depth estimation approach, FutureDepth, which enables the model to implicitly leverage multi-frame and motion cues to improve depth estimation by making it learn to predict the future at training. More specifically, we propose a future prediction network, F-Net, which takes the features of multiple consecutive frames and is trained to predict multi-frame features one time step ahead iteratively. In this way, F-Net learns the underlying motion and correspondence information, and we incorporate its features into the depth decoding process. Additionally, to enrich the learning of multiframe correspondence cues, we further leverage a reconstruction network, R-Net, which is trained via adaptively masked auto-encoding of multiframe feature volumes. At inference time, both F-Net and R-Net are used to produce queries to work with the depth decoder, as well as a final refinement network. Through extensive experiments on several benchmarks, i.e., NYUDv2, KITTI, DDAD, and Sintel, which cover indoor, driving, and open-domain scenarios, we show that FutureDepth significantly improves upon baseline models, outperforms existing video depth estimation methods, and sets new state-of-the-art (SOTA) accuracy. Furthermore, FutureDepth is more efficient than existing SOTA video depth estimation models and has similar latencies when comparing to monocular models
翻译:本文提出了一种新颖的视频深度估计方法——FutureDepth,该方法通过让模型在训练中学习预测未来,使其能够隐式利用多帧和运动线索来提高深度估计性能。具体而言,我们提出一个未来预测网络F-Net,该网络接收多个连续帧的特征,并通过迭代训练预测未来一个时间步的多帧特征。通过这种方式,F-Net学习到了底层的运动和对应关系信息,并将其特征融入深度解码过程。此外,为丰富多帧对应线索的学习,我们进一步利用重构网络R-Net,该网络通过自适应掩码自编码对多帧特征体进行训练。在推理阶段,F-Net和R-Net均用于生成查询向量,以协同深度解码器及最终的细化网络工作。通过在涵盖室内、驾驶和开放域场景的多个基准数据集(即NYUDv2、KITTI、DDAD和Sintel)上的广泛实验,我们证明FutureDepth在显著提升基线模型性能的同时,超越了现有视频深度估计方法,并达到了新的最先进(SOTA)精度。此外,FutureDepth比现有SOTA视频深度估计模型更高效,其延迟与单目模型相当。