Reconstructing accurate 3D scenes from images is a long-standing vision task. Due to the ill-posedness of the single-image reconstruction problem, most well-established methods are built upon multi-view geometry. State-of-the-art (SOTA) monocular metric depth estimation methods can only handle a single camera model and are unable to perform mixed-data training due to the metric ambiguity. Meanwhile, SOTA monocular methods trained on large mixed datasets achieve zero-shot generalization by learning affine-invariant depths, which cannot recover real-world metrics. In this work, we show that the key to a zero-shot single-view metric depth model lies in the combination of large-scale data training and resolving the metric ambiguity from various camera models. We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problems and can be effortlessly plugged into existing monocular models. Equipped with our module, monocular models can be stably trained with over 8 million images with thousands of camera models, resulting in zero-shot generalization to in-the-wild images with unseen camera settings. Experiments demonstrate SOTA performance of our method on 7 zero-shot benchmarks. Notably, our method won the championship in the 2nd Monocular Depth Estimation Challenge. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. The potential benefits extend to downstream tasks, which can be significantly improved by simply plugging in our model. For example, our model relieves the scale drift issues of monocular-SLAM (Fig. 1), leading to high-quality metric scale dense mapping. The code is available at https://github.com/YvanYin/Metric3D.
翻译:从图像中重建精确的三维场景是计算机视觉领域的长期任务。由于单张图像重建问题固有的病态性,大多数成熟方法均基于多视图几何构建。当前最优的单目公制深度估计方法仅能处理单一相机模型,且因公制歧义性而无法进行混合数据训练。与此同时,在大规模混合数据集上训练的最优单目方法通过学习仿射不变深度实现零样本泛化,但无法恢复真实世界的公制尺度。本文研究表明,实现零样本单视图公制深度模型的关键在于结合大规模数据训练与解析不同相机模型带来的公制歧义性。我们提出一种规范相机空间变换模块,该模块显式解决了歧义问题,可无缝嵌入现有单目模型。借助该模块,单目模型可稳定地在包含数千种相机模型的800万以上图像上进行训练,从而实现对未知相机设置的野外图像的零样本泛化。实验表明,本方法在7个零样本基准测试中均达到最优性能。值得注意的是,本方法在第二届单目深度估计挑战赛中夺得冠军。我们的方法能够对随机采集的互联网图像进行精确的公制三维结构恢复,为可信的单张图像测量铺平道路。其潜在优势可延伸至下游任务——仅需简单嵌入本模型即可显著提升性能。例如,本模型可缓解单目SLAM的尺度漂移问题(图1),从而生成高质量公制尺度稠密地图。代码开源于https://github.com/YvanYin/Metric3D。