3D scene reconstruction is a long-standing vision task. Existing approaches can be categorized into geometry-based and learning-based methods. The former leverages multi-view geometry but can face catastrophic failures due to the reliance on accurate pixel correspondence across views. The latter was proffered to mitigate these issues by learning 2D or 3D representation directly. However, without a large-scale video or 3D training data, it can hardly generalize to diverse real-world scenarios due to the presence of tens of millions or even billions of optimization parameters in the deep network. Recently, robust monocular depth estimation models trained with large-scale datasets have been proven to possess weak 3D geometry prior, but they are insufficient for reconstruction due to the unknown camera parameters, the affine-invariant property, and inter-frame inconsistency. Here, we propose a novel test-time optimization approach that can transfer the robustness of affine-invariant depth models such as LeReS to challenging diverse scenes while ensuring inter-frame consistency, with only dozens of parameters to optimize per video frame. Specifically, our approach involves freezing the pre-trained affine-invariant depth model's depth predictions, rectifying them by optimizing the unknown scale-shift values with a geometric consistency alignment module, and employing the resulting scale-consistent depth maps to robustly obtain camera poses and achieve dense scene reconstruction, even in low-texture regions. Experiments show that our method achieves state-of-the-art cross-dataset reconstruction on five zero-shot testing datasets.
翻译:3D场景重建是一项长期存在的视觉任务。现有方法可分为基于几何的方法和基于学习的方法。前者利用多视图几何,但可能因依赖跨视图的精确像素对应而遭遇灾难性失败。后者旨在通过直接学习2D或3D表示来缓解这些问题。然而,由于深度网络中数千万甚至数十亿的优化参数,若无大规模视频或3D训练数据,此类方法难以泛化到多样化的真实场景。近年来,经大规模数据集训练的鲁棒单目深度估计模型被证明具备弱3D几何先验,但由于未知相机参数、仿射不变性以及帧间不一致性,这些模型不足以支撑重建任务。为此,我们提出一种新颖的测试时优化方法,能够在确保帧间一致性的同时,将仿射不变深度模型(如LeReS)的鲁棒性迁移到具有挑战性的多样化场景中,且每帧视频仅需优化数十个参数。具体而言,该方法包括:冻结预训练的仿射不变深度模型的深度预测值,通过几何一致性对齐模块优化未知的尺度-偏移值对其进行修正,并利用由此得到的尺度一致深度图鲁棒地获取相机位姿,实现稠密场景重建,即使在低纹理区域也能有效工作。实验表明,我们的方法在五个零样本测试数据集上实现了跨数据集重建的最优性能。