This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules, we aim to build a simple yet powerful foundation model dealing with any images under any circumstances. To this end, we scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data (~62M), which significantly enlarges the data coverage and thus is able to reduce the generalization error. We investigate two simple yet effective strategies that make data scaling-up promising. First, a more challenging optimization target is created by leveraging data augmentation tools. It compels the model to actively seek extra visual knowledge and acquire robust representations. Second, an auxiliary supervision is developed to enforce the model to inherit rich semantic priors from pre-trained encoders. We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos. It demonstrates impressive generalization ability. Further, through fine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs are set. Our better depth model also results in a better depth-conditioned ControlNet. Our models are released at https://github.com/LiheYoung/Depth-Anything.
翻译:本文提出深度万物(Depth Anything),一种用于鲁棒单目深度估计的高实用解决方案。在不追求新颖技术模块的前提下,我们旨在构建一个简单而强大的基础模型,能够处理任何环境下的任意图像。为此,我们通过设计数据引擎,大规模收集并自动标注约6200万无标注数据,显著扩展数据覆盖范围,从而降低泛化误差。我们探究了两种简单有效的策略以推动数据规模化。首先,利用数据增强工具创建更具挑战性的优化目标,迫使模型主动获取额外视觉知识并学习鲁棒表征。其次,开发辅助监督机制,使模型继承预训练编码器中的丰富语义先验。我们广泛评估了模型的零样本能力,涵盖六个公开数据集和随机拍摄的照片,展现出惊人的泛化性能。进一步,通过使用NYUv2和KITTI的度量深度信息进行微调,模型刷新了当前最优性能(SOTA)。更好的深度模型也提升了深度条件型ControlNet的效果。相关模型已发布于https://github.com/LiheYoung/Depth-Anything。