Video depth estimation aims to infer temporally consistent depth. One approach is to finetune a single-image model on each video with geometry constraints, which proves inefficient and lacks robustness. An alternative is learning to enforce consistency from data, which requires well-designed models and sufficient video depth data. To address both challenges, we introduce NVDS+ that stabilizes inconsistent depth estimated by various single-image models in a plug-and-play manner. We also elaborate a large-scale Video Depth in the Wild (VDW) dataset, which contains 14,203 videos with over two million frames, making it the largest natural-scene video depth dataset. Additionally, a bidirectional inference strategy is designed to improve consistency by adaptively fusing forward and backward predictions. We instantiate a model family ranging from small to large scales for different applications. The method is evaluated on VDW dataset and three public benchmarks. To further prove the versatility, we extend NVDS+ to video semantic segmentation and several downstream applications like bokeh rendering, novel view synthesis, and 3D reconstruction. Experimental results show that our method achieves significant improvements in consistency, accuracy, and efficiency. Our work serves as a solid baseline and data foundation for learning-based video depth estimation. Code and dataset are available at: https://github.com/RaymondWang987/NVDS
翻译:视频深度估计旨在推断时间一致的深度。一种方法是在每个视频上通过几何约束微调单帧图像模型,这种方法效率低下且缺乏鲁棒性。另一种方法是从数据中学习强制一致性,这需要精心设计的模型和充足的视频深度数据。为应对这两项挑战,我们提出了NVDS+,它能以即插即用的方式稳定由各类单帧图像模型估计出的不一致深度。我们还构建了大规模野外视频深度数据集,包含14,203个视频超过两百万帧,成为当前最大的自然场景视频深度数据集。此外,我们设计了双向推理策略,通过自适应融合前向与后向预测来提升一致性。我们实例化了从小型到大型的模型系列以适应不同应用场景。该方法在VDW数据集及三个公开基准上进行了评估。为进一步证明其通用性,我们将NVDS+扩展至视频语义分割及多项下游应用,如虚化渲染、新视角合成和三维重建。实验结果表明,我们的方法在一致性、准确性和效率方面均取得显著提升。本工作为基于学习的视频深度估计提供了坚实的基准和数据基础。代码与数据集已开源:https://github.com/RaymondWang987/NVDS