Careful robot manipulation in every-day cluttered environments requires an accurate understanding of the 3D scene, in order to grasp and place objects stably and reliably and to avoid mistakenly colliding with other objects. In general, we must construct such a 3D interpretation of a complex scene based on limited input, such as a single RGB-D image. We describe SceneComplete, a system for constructing a complete, segmented, 3D model of a scene from a single view. It provides a novel pipeline for composing general-purpose pretrained perception modules (vision-language, segmentation, image-inpainting, image-to-3D, and pose-estimation) to obtain high-accuracy results. We demonstrate its accuracy and effectiveness with respect to ground-truth models in a large benchmark dataset and show that its accurate whole-object reconstruction enables robust grasp proposal generation, including for a dexterous hand.
翻译:在日常杂乱环境中进行精细的机器人操作需要准确理解三维场景,以实现稳定可靠的物体抓取与放置,并避免与其他物体发生意外碰撞。通常,我们必须基于有限输入(如单幅RGB-D图像)构建此类复杂场景的三维解释。本文提出SceneComplete系统,该系统能够从单视角构建完整、分割化的三维场景模型。它提供了一种新颖的流程,通过组合通用预训练感知模块(视觉-语言模型、分割模型、图像修复模型、图像到三维模型及姿态估计模型)来获得高精度结果。我们在大型基准数据集上通过真实模型验证了其精度与有效性,并证明其精确的全物体重建能力能够实现稳健的抓取方案生成,包括针对灵巧手的抓取方案。