This work presents Depth Anything V2. Without pursuing fancy techniques, we aim to reveal crucial findings to pave the way towards building a powerful monocular depth estimation model. Notably, compared with V1, this version produces much finer and more robust depth predictions through three key practices: 1) replacing all labeled real images with synthetic images, 2) scaling up the capacity of our teacher model, and 3) teaching student models via the bridge of large-scale pseudo-labeled real images. Compared with the latest models built on Stable Diffusion, our models are significantly more efficient (more than 10x faster) and more accurate. We offer models of different scales (ranging from 25M to 1.3B params) to support extensive scenarios. Benefiting from their strong generalization capability, we fine-tune them with metric depth labels to obtain our metric depth models. In addition to our models, considering the limited diversity and frequent noise in current test sets, we construct a versatile evaluation benchmark with precise annotations and diverse scenes to facilitate future research.
翻译:本工作提出了Depth Anything V2。我们并未追求复杂的技术,而是旨在揭示关键发现,为构建强大的单目深度估计模型铺平道路。值得注意的是,与V1相比,本版本通过三项关键实践实现了更精细、更鲁棒的深度预测:1) 将所有标注的真实图像替换为合成图像;2) 扩大教师模型的容量;3) 通过大规模伪标注真实图像作为桥梁来指导学生模型。与基于Stable Diffusion的最新模型相比,我们的模型显著更高效(推理速度快10倍以上)且更准确。我们提供了不同规模的模型(参数量从25M到1.3B)以支持广泛的应用场景。得益于其强大的泛化能力,我们使用度量深度标签对其进行微调,从而获得度量深度模型。除了模型本身,考虑到当前测试集多样性有限且常含噪声,我们构建了一个包含精确标注和多样化场景的通用评估基准,以促进未来研究。