Estimating the depth of objects from a single image is a valuable task for many vision, robotics, and graphics applications. However, current methods often fail to produce accurate depth for objects in diverse scenes. In this work, we propose a simple yet effective Background Prompting strategy that adapts the input object image with a learned background. We learn the background prompts only using small-scale synthetic object datasets. To infer object depth on a real image, we place the segmented object into the learned background prompt and run off-the-shelf depth networks. Background Prompting helps the depth networks focus on the foreground object, as they are made invariant to background variations. Moreover, Background Prompting minimizes the domain gap between synthetic and real object images, leading to better sim2real generalization than simple finetuning. Results on multiple synthetic and real datasets demonstrate consistent improvements in real object depths for a variety of existing depth networks. Code and optimized background prompts can be found at: https://mbaradad.github.io/depth_prompt.
翻译:从单张图像中估计物体深度是许多视觉、机器人和图形应用中的重要任务。然而,当前方法在多样化场景中往往难以生成准确的物体深度。本文提出一种简单而有效的背景提示策略,通过将输入物体图像与学习得到的背景进行适配。我们仅利用小规模合成物体数据集学习背景提示。为推断真实图像中的物体深度,我们将分割后的物体置于已学习的背景提示中,并运行现成的深度网络。背景提示使深度网络聚焦于前景物体,因其对背景变化具有不变性。此外,该策略最小化了合成与真实物体图像之间的域差距,相比简单微调能实现更优的模拟到真实泛化能力。在多个合成与真实数据集上的实验结果表明,多种现有深度网络的真实物体深度估计均得到持续改进。代码及优化后的背景提示可访问:https://mbaradad.github.io/depth_prompt。