Recent developments in monocular depth estimation methods enable high-quality depth estimation of single-view images but fail to estimate consistent video depth across different frames. Recent works address this problem by applying a video diffusion model to generate video depth conditioned on the input video, which is training-expensive and can only produce scale-invariant depth values without camera poses. In this paper, we propose a novel video-depth estimation method called Align3R to estimate temporal consistent depth maps for a dynamic video. Our key idea is to utilize the recent DUSt3R model to align estimated monocular depth maps of different timesteps. First, we fine-tune the DUSt3R model with additional estimated monocular depth as inputs for the dynamic scenes. Then, we apply optimization to reconstruct both depth maps and camera poses. Extensive experiments demonstrate that Align3R estimates consistent video depth and camera poses for a monocular video with superior performance than baseline methods.
翻译:近期单目深度估计方法的发展使得单视图图像的高质量深度估计成为可能,但难以在不同帧之间估计一致的视频深度。现有研究通过应用视频扩散模型,以输入视频为条件生成视频深度来解决此问题,但该方法训练成本高昂,且仅能产生无相机姿态的尺度不变深度值。本文提出一种名为Align3R的新型视频深度估计方法,用于为动态视频估计时序一致的深度图。我们的核心思想是利用最新的DUSt3R模型对不同时间步的估计单目深度图进行对齐。首先,我们以动态场景中额外估计的单目深度作为输入对DUSt3R模型进行微调。随后,通过优化同时重建深度图与相机姿态。大量实验表明,Align3R能够为单目视频估计一致的视频深度与相机姿态,其性能优于基线方法。