In this study, we address the challenge of 3D scene structure recovery from monocular depth estimation. While traditional depth estimation methods leverage labeled datasets to directly predict absolute depth, recent advancements advocate for mix-dataset training, enhancing generalization across diverse scenes. However, such mixed dataset training yields depth predictions only up to an unknown scale and shift, hindering accurate 3D reconstructions. Existing solutions necessitate extra 3D datasets or geometry-complete depth annotations, constraints that limit their versatility. In this paper, we propose a learning framework that trains models to predict geometry-preserving depth without requiring extra data or annotations. To produce realistic 3D structures, we render novel views of the reconstructed scenes and design loss functions to promote depth estimation consistency across different views. Comprehensive experiments underscore our framework's superior generalization capabilities, surpassing existing state-of-the-art methods on several benchmark datasets without leveraging extra training information. Moreover, our innovative loss functions empower the model to autonomously recover domain-specific scale-and-shift coefficients using solely unlabeled images.
翻译:在本研究中,我们探讨了从单目深度估计中恢复三维场景结构的挑战。传统深度估计方法利用标注数据集直接预测绝对深度,而近期研究倾向于采用混合数据集训练以增强模型在多样化场景中的泛化能力。然而,这种混合数据集训练方法仅能预测出未知尺度和偏移下的深度值,阻碍了精确的三维重建。现有解决方案需额外依赖三维数据集或几何完整的深度标注,这些约束限制了其通用性。本文提出一种无需额外数据或标注的学习框架,用于训练模型预测几何保持的深度。为生成真实的三维结构,我们通过可微分渲染技术重建场景的新视角,并设计损失函数以促进不同视角间深度估计的一致性。大量实验表明,本框架在无需利用额外训练信息的情况下,在多个基准数据集上展现出超越现有最先进方法的泛化能力。此外,所提出的创新性损失函数使模型能够仅凭无标注图像自主恢复特定领域的尺度-偏移系数。