Transparent object perception is indispensable for numerous robotic tasks. However, accurately segmenting and estimating the depth of transparent objects remain challenging due to complex optical properties. Existing methods primarily delve into only one task using extra inputs or specialized sensors, neglecting the valuable interactions among tasks and the subsequent refinement process, leading to suboptimal and blurry predictions. To address these issues, we propose a monocular framework, which is the first to excel in both segmentation and depth estimation of transparent objects, with only a single-image input. Specifically, we devise a novel semantic and geometric fusion module, effectively integrating the multi-scale information between tasks. In addition, drawing inspiration from human perception of objects, we further incorporate an iterative strategy, which progressively refines initial features for clearer results. Experiments on two challenging synthetic and real-world datasets demonstrate that our model surpasses state-of-the-art monocular, stereo, and multi-view methods by a large margin of about 38.8%-46.2% with only a single RGB input. Codes and models are publicly available at https://github.com/L-J-Yuan/MODEST.
翻译:透明物体感知对于众多机器人任务不可或缺。然而,由于复杂的光学特性,准确分割并估计透明物体的深度仍然具有挑战性。现有方法主要依赖额外输入或专用传感器仅专注于单一任务,忽视了任务间有价值的交互以及后续的细化过程,导致预测结果欠佳且模糊。为解决这些问题,我们提出一个单目框架,该框架首次仅凭单张图像输入就在透明物体的分割与深度估计两方面均表现出色。具体而言,我们设计了一个新颖的语义与几何融合模块,有效整合了任务间的多尺度信息。此外,受人类物体感知方式的启发,我们进一步引入了一种迭代策略,该策略逐步细化初始特征以获得更清晰的结果。在两个具有挑战性的合成及真实世界数据集上的实验表明,我们的模型仅需单张RGB输入,就以约38.8%-46.2%的显著优势超越了当前最先进的单目、双目及多视图方法。代码与模型已在 https://github.com/L-J-Yuan/MODEST 公开。