Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image

from arxiv, Accepted to ICCV 2023. Won the championship in the 2nd Monocular Depth Estimation Challenge. The code is available at https://github.com/YvanYin/Metric3D

Reconstructing accurate 3D scenes from images is a long-standing vision task. Due to the ill-posedness of the single-image reconstruction problem, most well-established methods are built upon multi-view geometry. State-of-the-art (SOTA) monocular metric depth estimation methods can only handle a single camera model and are unable to perform mixed-data training due to the metric ambiguity. Meanwhile, SOTA monocular methods trained on large mixed datasets achieve zero-shot generalization by learning affine-invariant depths, which cannot recover real-world metrics. In this work, we show that the key to a zero-shot single-view metric depth model lies in the combination of large-scale data training and resolving the metric ambiguity from various camera models. We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problems and can be effortlessly plugged into existing monocular models. Equipped with our module, monocular models can be stably trained with over 8 million images with thousands of camera models, resulting in zero-shot generalization to in-the-wild images with unseen camera settings. Experiments demonstrate SOTA performance of our method on 7 zero-shot benchmarks. Notably, our method won the championship in the 2nd Monocular Depth Estimation Challenge. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. The potential benefits extend to downstream tasks, which can be significantly improved by simply plugging in our model. For example, our model relieves the scale drift issues of monocular-SLAM (Fig. 1), leading to high-quality metric scale dense mapping. The code is available at https://github.com/YvanYin/Metric3D.

翻译：从图像中重建精确的三维场景是计算机视觉领域的长期任务。由于单张图像重建问题固有的病态性，大多数成熟方法均基于多视图几何构建。当前最优的单目公制深度估计方法仅能处理单一相机模型，且因公制歧义性而无法进行混合数据训练。与此同时，在大规模混合数据集上训练的最优单目方法通过学习仿射不变深度实现零样本泛化，但无法恢复真实世界的公制尺度。本文研究表明，实现零样本单视图公制深度模型的关键在于结合大规模数据训练与解析不同相机模型带来的公制歧义性。我们提出一种规范相机空间变换模块，该模块显式解决了歧义问题，可无缝嵌入现有单目模型。借助该模块，单目模型可稳定地在包含数千种相机模型的800万以上图像上进行训练，从而实现对未知相机设置的野外图像的零样本泛化。实验表明，本方法在7个零样本基准测试中均达到最优性能。值得注意的是，本方法在第二届单目深度估计挑战赛中夺得冠军。我们的方法能够对随机采集的互联网图像进行精确的公制三维结构恢复，为可信的单张图像测量铺平道路。其潜在优势可延伸至下游任务——仅需简单嵌入本模型即可显著提升性能。例如，本模型可缓解单目SLAM的尺度漂移问题（图1），从而生成高质量公制尺度稠密地图。代码开源于https://github.com/YvanYin/Metric3D。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日