Dense and accurate depth estimation is essential for robotic manipulation, grasping, and navigation, yet currently available depth sensors are prone to errors on transparent, specular, and general non-Lambertian surfaces. To mitigate these errors, large-scale monocular depth estimation approaches provide strong structural priors, but their predictions can be potentially skewed or mis-scaled in metric units, limiting their direct use in robotics. Thus, in this work, we propose a training-free depth grounding framework that anchors monocular depth estimation priors from a depth foundation model in raw sensor depth through factor graph optimization. Our method performs a patch-wise affine alignment, locally grounding monocular predictions in metric real-world depth while preserving fine-grained geometric structure and discontinuities. To facilitate evaluation in challenging real-world conditions, we introduce a benchmark dataset with dense scene-wide ground truth depth in the presence of non-Lambertian objects. Ground truth is obtained via matte reflection spray and multi-camera fusion, overcoming the reliance on object-only CAD-based annotations used in prior datasets. Extensive evaluations across diverse sensors and domains demonstrate consistent improvements in depth performance without any (re-)training. We make our implementation publicly available at https://anchord.cs.uni-freiburg.de.
翻译:稠密且精确的深度估计对于机器人操作、抓取和导航至关重要,然而当前可用的深度传感器在透明、镜面及一般非朗伯表面易产生误差。为缓解这些误差,大规模单目深度估计方法提供了强大的结构先验,但其预测结果在度量单位上可能存在偏差或尺度失准,限制了其在机器人领域的直接应用。为此,本文提出一种无需训练的深度锚定框架,通过因子图优化将来自深度基础模型的单目深度估计先验锚定至原始传感器深度。该方法执行逐块仿射对齐,在保留精细几何结构与不连续性的同时,将单目预测局部锚定至真实世界的度量深度。为在复杂真实场景中促进评估,我们引入一个包含非朗伯物体场景级稠密真实深度真值的基准数据集。该真值通过哑光反射喷涂与多相机融合获得,克服了以往数据集仅依赖基于CAD物体标注的局限。跨多传感器与多领域的广泛评估表明,该方法无需任何(重新)训练即可持续提升深度估计性能。我们在https://anchord.cs.uni-freiburg.de公开了代码实现。