Depth estimation is an important step in many computer vision problems such as 3D reconstruction, novel view synthesis, and computational photography. Most existing work focuses on depth estimation from single frames. When applied to videos, the result lacks temporal consistency, showing flickering and swimming artifacts. In this paper we aim to estimate temporally consistent depth maps of video streams in an online setting. This is a difficult problem as future frames are not available and the method must choose between enforcing consistency and correcting errors from previous estimations. The presence of dynamic objects further complicates the problem. We propose to address these challenges by using a global point cloud that is dynamically updated each frame, along with a learned fusion approach in image space. Our approach encourages consistency while simultaneously allowing updates to handle errors and dynamic objects. Qualitative and quantitative results show that our method achieves state-of-the-art quality for consistent video depth estimation.
翻译:深度估计是诸多计算机视觉问题(如三维重建、新视角合成和计算摄影)中的关键步骤。现有工作多聚焦于单帧图像的深度估计。当应用于视频时,其结果缺乏时序一致性,出现闪烁和抖动伪影。本文旨在在线场景下估计视频流中时序一致的深度图。这是一个难题,因为未来帧不可用,且方法必须在保持一致性与修正先前估计误差之间做出权衡。动态物体的存在进一步加剧了问题的复杂性。我们提出利用每帧动态更新的全局点云,并结合图像空间中的学习型融合方法来应对这些挑战。该方法在鼓励一致性的同时,允许更新以处理误差和动态物体。定性与定量结果表明,我们的方法在一致视频深度估计中达到了当前最优水平。