MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources

Scaling has powered recent advances in vision foundation models, yet extending this paradigm to metric depth estimation remains challenging due to heterogeneous sensor noise, camera-dependent biases, and metric ambiguity in noisy cross-source 3D data. We introduce Metric Anything, a simple and scalable pretraining framework that learns metric depth from noisy, diverse 3D sources without manually engineered prompts, camera-specific modeling, or task-specific architectures. Central to our approach is the Sparse Metric Prompt, created by randomly masking depth maps, which serves as a universal interface that decouples spatial reasoning from sensor and camera biases. Using about 20M image-depth pairs spanning reconstructed, captured, and rendered 3D data across 10000 camera models, we demonstrate-for the first time-a clear scaling trend in the metric depth track. The pretrained model excels at prompt-driven tasks such as depth completion, super-resolution and Radar-camera fusion, while its distilled prompt-free student achieves state-of-the-art results on monocular depth estimation, camera intrinsics recovery, single/multi-view metric 3D reconstruction, and VLA planning. We also show that using pretrained ViT of Metric Anything as a visual encoder significantly boosts Multimodal Large Language Model capabilities in spatial intelligence. These results show that metric depth estimation can benefit from the same scaling laws that drive modern foundation models, establishing a new path toward scalable and efficient real-world metric perception. We open-source MetricAnything at http://metric-anything.github.io/metric-anything-io/ to support community research.

翻译：规模化推动了视觉基础模型的近期进展，然而将这一范式扩展到度量深度估计仍面临挑战，原因在于异构传感器噪声、相机相关偏差以及跨来源噪声三维数据中的度量模糊性。我们提出MetricAnything，一个简单且可扩展的预训练框架，能够从噪声多样化的三维源中学习度量深度，无需人工设计提示、相机特定建模或任务特定架构。我们方法的核心是稀疏度量提示，通过随机掩码深度图生成，其作为通用接口将空间推理与传感器及相机偏差解耦。利用约2000万幅图像-深度对（涵盖重建、采集和渲染的三维数据，涉及10000种相机模型），我们首次在度量深度领域展示了清晰的规模化趋势。预训练模型在提示驱动任务（如深度补全、超分辨率和雷达-相机融合）中表现优异，而其蒸馏出的无提示学生模型在单目深度估计、相机内参恢复、单/多视角度量三维重建以及VLA规划任务中取得了最先进的结果。我们还证明，使用MetricAnything的预训练ViT作为视觉编码器，能显著提升多模态大语言模型在空间智能方面的能力。这些结果表明，度量深度估计能够受益于驱动现代基础模型的相同缩放定律，为可扩展且高效的现实世界度量感知开辟了新路径。我们在http://metric-anything.github.io/metric-anything-io/开源MetricAnything以支持社区研究。