Vision-based grasping of unknown objects in unstructured environments is a key challenge for autonomous robotic manipulation. A practical grasp synthesis system is required to generate a diverse set of 6-DoF grasps from which a task-relevant grasp can be executed. Although generative models are suitable for learning such complex data distributions, existing models have limitations in grasp quality, long training times, and a lack of flexibility for task-specific generation. In this work, we present GraspLDM- a modular generative framework for 6-DoF grasp synthesis that uses diffusion models as priors in the latent space of a VAE. GraspLDM learns a generative model of object-centric $SE(3)$ grasp poses conditioned on point clouds. GraspLDM's architecture enables us to train task-specific models efficiently by only re-training a small de-noising network in the low-dimensional latent space, as opposed to existing models that need expensive re-training. Our framework provides robust and scalable models on both full and single-view point clouds. GraspLDM models trained with simulation data transfer well to the real world and provide an 80\% success rate for 80 grasp attempts of diverse test objects, improving over existing generative models. We make our implementation available at https://github.com/kuldeepbrd1/graspldm .
翻译:非结构化环境中基于视觉的未知物体抓取是自主机器人操作的关键挑战。一个实用的抓取合成系统需生成多样化的6自由度抓取姿态,以便执行与任务相关的抓取动作。尽管生成模型适合学习此类复杂数据分布,但现有模型在抓取质量、训练耗时及任务特定生成的灵活性方面存在局限。本文提出GraspLDM——一种基于扩散模型作为变分自编码器潜在空间先验的模块化6自由度抓取合成框架。GraspLDM学习以点云为条件、面向物体中心的$SE(3)$抓取姿态的生成模型。其架构仅需在低维潜在空间中重新训练小型去噪网络即可高效训练任务特定模型,而现有模型则需要昂贵的重新训练。该框架在完整点云与单视角点云上均能获得鲁棒且可扩展的模型。基于仿真数据训练的GraspLDM模型可良好迁移至真实场景,对80次不同测试物体的抓取尝试成功率达80%,优于现有生成模型。代码已开源:https://github.com/kuldeepbrd1/graspldm。