While recent depth foundation models exhibit strong zero-shot generalization, achieving accurate metric depth across diverse camera types-particularly those with large fields of view (FoV) such as fisheye and 360-degree cameras-remains a significant challenge. This paper presents Depth Any Camera (DAC), a powerful zero-shot metric depth estimation framework that extends a perspective-trained model to effectively handle cameras with varying FoVs. The framework is designed to ensure that all existing 3D data can be leveraged, regardless of the specific camera types used in new applications. Remarkably, DAC is trained exclusively on perspective images but generalizes seamlessly to fisheye and 360-degree cameras without the need for specialized training data. DAC employs Equi-Rectangular Projection (ERP) as a unified image representation, enabling consistent processing of images with diverse FoVs. Its core components include pitch-aware Image-to-ERP conversion with efficient online augmentation to simulate distorted ERP patches from undistorted inputs, FoV alignment operations to enable effective training across a wide range of FoVs, and multi-resolution data augmentation to further address resolution disparities between training and testing. DAC achieves state-of-the-art zero-shot metric depth estimation, improving $\delta_1$ accuracy by up to 50% on multiple fisheye and 360-degree datasets compared to prior metric depth foundation models, demonstrating robust generalization across camera types.
翻译:尽管近期的深度基础模型展现出强大的零样本泛化能力,但要在多种相机类型——尤其是具有大视场(FoV)的相机,如鱼眼和360度相机——上实现精确的度量深度估计,仍然是一个重大挑战。本文提出了深度任意相机(DAC),一个强大的零采样度量深度估计框架,它扩展了基于透视图像训练的模型,使其能有效处理具有不同FoV的相机。该框架的设计确保了所有现有的3D数据都能被利用,无论新应用中使用何种特定相机类型。值得注意的是,DAC仅使用透视图像进行训练,却能无缝泛化到鱼眼和360度相机,无需专门的训练数据。DAC采用等距柱状投影(ERP)作为统一的图像表示,从而能够一致地处理具有不同FoV的图像。其核心组件包括:具有高效在线增强的俯仰感知图像到ERP转换,用于从无失真输入模拟失真的ERP图像块;FoV对齐操作,以实现跨广泛FoV范围的有效训练;以及多分辨率数据增强,以进一步解决训练与测试之间的分辨率差异。DAC实现了最先进的零样本度量深度估计,在多个鱼眼和360度数据集上,相较于先前的度量深度基础模型,将$\delta_1$准确率提升了高达50%,展示了跨相机类型的强大泛化能力。