Depth sensing is an important problem for 3D vision-based robotics. Yet, a real-world active stereo or ToF depth camera often produces noisy and incomplete depth which bottlenecks robot performances. In this work, we propose D3RoMa, a learning-based depth estimation framework on stereo image pairs that predicts clean and accurate depth in diverse indoor scenes, even in the most challenging scenarios with translucent or specular surfaces where classical depth sensing completely fails. Key to our method is that we unify depth estimation and restoration into an image-to-image translation problem by predicting the disparity map with a denoising diffusion probabilistic model. At inference time, we further incorporated a left-right consistency constraint as classifier guidance to the diffusion process. Our framework combines recently advanced learning-based approaches and geometric constraints from traditional stereo vision. For model training, we create a large scene-level synthetic dataset with diverse transparent and specular objects to compensate for existing tabletop datasets. The trained model can be directly applied to real-world in-the-wild scenes and achieve state-of-the-art performance in multiple public depth estimation benchmarks. Further experiments in real environments show that accurate depth prediction significantly improves robotic manipulation in various scenarios.
翻译:深度感知是基于三维视觉的机器人技术中的一个重要问题。然而,现实世界中的主动立体或飞行时间深度相机通常会产生噪声大且不完整的深度数据,这成为机器人性能的瓶颈。在本工作中,我们提出了D3RoMa,一个基于学习的立体图像对深度估计框架,它能在多样化的室内场景中预测出干净且准确的深度,即使在具有半透明或镜面表面的最具挑战性的场景中——传统深度感知在这些场景中完全失效。我们方法的关键在于,通过使用去噪扩散概率模型预测视差图,将深度估计与恢复统一为一个图像到图像的转换问题。在推理阶段,我们进一步将左右一致性约束作为分类器引导整合到扩散过程中。我们的框架结合了近期先进的基于学习的方法与传统立体视觉中的几何约束。为了训练模型,我们创建了一个包含多样化透明和镜面物体的大型场景级合成数据集,以弥补现有桌面数据集的不足。训练好的模型可以直接应用于现实世界中的非受控场景,并在多个公开的深度估计基准测试中达到最先进的性能。在真实环境中的进一步实验表明,准确的深度预测显著提升了机器人在各种场景下的操作性能。