Prompts play a critical role in unleashing the power of language and vision foundation models for specific tasks. For the first time, we introduce prompting into depth foundation models, creating a new paradigm for metric depth estimation termed Prompt Depth Anything. Specifically, we use a low-cost LiDAR as the prompt to guide the Depth Anything model for accurate metric depth output, achieving up to 4K resolution. Our approach centers on a concise prompt fusion design that integrates the LiDAR at multiple scales within the depth decoder. To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. To further extend our method to work with any prompt depth points, we propose a new prompting mechanism, which serializes the input depth points into tokens and uses self-attention to enhance image tokens from depth foundation models. Our approach sets new state-of-the-arts on 8 zero-shot depth benchmarks and benefits downstream applications, including 3D reconstruction and generalized robotic grasping. The code is available at https://github.com/DepthAnything/PromptDA .
翻译:提示在释放语言和视觉基础模型针对特定任务的能力方面起着关键作用。我们首次将提示机制引入深度基础模型,创建了一种称为提示深度任意模型(Prompt Depth Anything)的度量深度估计新范式。具体而言,我们使用低成本激光雷达作为提示,引导深度任意模型输出精确的度量深度,分辨率最高可达4K。我们的方法核心在于简洁的提示融合设计,该设计在深度解码器的多个尺度上集成激光雷达数据。为解决同时包含激光雷达深度和精确真实深度标注的有限数据集带来的训练挑战,我们提出了一个可扩展的数据流水线,包括合成数据激光雷达模拟和真实数据伪真实深度生成。为了进一步扩展我们的方法,使其能够处理任意提示深度点,我们提出了一种新的提示机制,该机制将输入深度点序列化为标记,并利用自注意力机制增强来自深度基础模型的图像标记。我们的方法在8个零样本深度基准测试中创造了新的最优性能,并有益于下游应用,包括三维重建和广义机器人抓取。代码可在 https://github.com/DepthAnything/PromptDA 获取。