Object Pose and Shape Estimation for Grasping: Does it Work?

The problem of object pose and shape estimation has seen key advancements lately. Encoder-decoder (e.g., SAM3D, LRM, CRISP) and diffusion-based models (e.g., InstantMesh, Zero123, SceneComplete) have shown category-agnostic shape encoding capacity and open-set generalizability. In this work, we ask the question: Are the object pose and shape estimation methods mature enough, such that when used with antipodal grasp sampling, can outperform the end-to-end grasp synthesis methods? We explore this question in detail by scoping our study to parallel jaw grippers, 7-DoF grasps, and single-view RGB(-D) image as input. We implement and compare a state-of-the-art, end-to-end grasp synthesis method and three modular methods, which first estimate the object pose and shape for all objects in the scene, and generate grasps using antipodal sampling. We observe that the modular methods outperform the end-to-end method in all our experiments. The modular methods are able to synthesize plenty of grasps, even for small objects, where the end-to-end methods fail. The effectiveness of the modular methods is contingent on the accuracy of the pose and shape estimation, and suffers partial degradation in cluttered scenes - a limitation of the existing pose and shape estimation methods. We also analyze the failure modes and run-times for the three modular methods, which use two different ways of object pose and shape estimation: one based on an encoder-decoder model, while another a diffusion model. Finally, we demonstrate that the single-view object pose and shape estimation methods can be augmented with vision-language models to yield language-conditioned grasps from just single-view RGB-D image as input. We notice comparable performance to the state-of-the-art LERF-TOGO baseline.

翻译：近年来，目标姿态与形状估计问题取得了关键进展。编码器-解码器模型（如SAM3D、LRM、CRISP）及基于扩散的模型（如InstantMesh、Zero123、SceneComplete）已展现出类别无关的形状编码能力与开放式泛化性。本研究提出疑问：当这些目标姿态与形状估计方法与对极抓取采样结合时，其成熟度是否足以超越端到端抓取合成方法？我们通过将研究限定于平行夹爪、七自由度抓取及单视图RGB(-D)图像输入，对此问题进行了深入探讨。我们实现并对比了一种先进的端到端抓取合成方法与三种模块化方法：后者首先估计场景中所有目标的姿态与形状，再利用对极采样生成抓取。在所有实验中，模块化方法均优于端到端方法。模块化方法能为小目标合成大量抓取（即使端到端方法在此类场景中失效）。模块化方法的有效性依赖于姿态与形状估计的准确性，并在杂乱场景中会出现部分性能退化——这体现了现有姿态与形状估计方法的局限性。我们还分析了三种模块化方法的失败模式与运行时间，这些方法采用两种不同的目标姿态与形状估计方式：一种基于编码器-解码器模型，另一种基于扩散模型。最后，我们证明单视图目标姿态与形状估计方法可借助视觉语言模型增强，仅以单视图RGB-D图像为输入即可生成语言条件驱动的抓取，其性能与先进的LERF-TOGO基线相当。