Efficiently reconstructing accurate 3D models from monocular video is a key challenge in computer vision, critical for advancing applications in virtual reality, robotics, and scene understanding. Existing approaches typically require pre-computed camera parameters and frame-by-frame reconstruction pipelines, which are prone to error accumulation and entail significant computational overhead. To address these limitations, we introduce VideoLifter, a novel framework that leverages geometric priors from a learnable model to incrementally optimize a globally sparse to dense 3D representation directly from video sequences. VideoLifter segments the video sequence into local windows, where it matches and registers frames, constructs consistent fragments, and aligns them hierarchically to produce a unified 3D model. By tracking and propagating sparse point correspondences across frames and fragments, VideoLifter incrementally refines camera poses and 3D structure, minimizing reprojection error for improved accuracy and robustness. This approach significantly accelerates the reconstruction process, reducing training time by over 82% while surpassing current state-of-the-art methods in visual fidelity and computational efficiency.
翻译:从单目视频高效重建精确的三维模型是计算机视觉领域的关键挑战,对于推动虚拟现实、机器人技术和场景理解等应用至关重要。现有方法通常需要预计算的相机参数和逐帧重建流程,这些方法容易产生误差累积并带来显著的计算开销。为应对这些局限,我们提出了VideoLifter——一种新颖的框架,该框架利用可学习模型的几何先验,直接从视频序列中增量优化全局稀疏到稠密的三维表示。VideoLifter将视频序列分割为局部窗口,在其中进行帧匹配与配准、构建一致片段,并通过分层对齐生成统一的三维模型。通过跨帧和片段跟踪并传播稀疏点对应关系,VideoLifter逐步优化相机位姿与三维结构,最小化重投影误差以提升精度与鲁棒性。该方法显著加速了重建过程,在训练时间减少超过82%的同时,在视觉保真度与计算效率方面超越了当前最先进的方法。