Prompts play a critical role in unleashing the power of language and vision foundation models for specific tasks. For the first time, we introduce prompting into depth foundation models, creating a new paradigm for metric depth estimation termed Prompt Depth Anything. Specifically, we use a low-cost LiDAR as the prompt to guide the Depth Anything model for accurate metric depth output, achieving up to 4K resolution. Our approach centers on a concise prompt fusion design that integrates the LiDAR at multiple scales within the depth decoder. To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. Our approach sets new state-of-the-arts on the ARKitScenes and ScanNet++ datasets and benefits downstream applications, including 3D reconstruction and generalized robotic grasping.
翻译:提示在释放语言与视觉基础模型针对特定任务的能力方面起着关键作用。本文首次将提示机制引入深度基础模型,开创了一种称为提示深度任意(Prompt Depth Anything)的度量深度估计新范式。具体而言,我们采用低成本激光雷达作为提示,引导深度任意模型输出精确的度量深度,分辨率最高可达4K。本方法的核心在于简洁的提示融合设计,该设计在深度解码器的多尺度层级中集成激光雷达数据。针对同时包含激光雷达深度与精确真实深度标注的数据集稀缺所带来的训练挑战,我们提出了一种可扩展的数据流水线,包括合成数据激光雷达模拟与真实数据伪真实深度生成。我们的方法在ARKitScenes和ScanNet++数据集上取得了新的最优性能,并惠及下游应用,包括三维重建与广义机器人抓取。