Synthesizing diverse and accurate grasps with multi-fingered hands is an important yet challenging task in robotics. Previous efforts focusing on generative modeling have fallen short of precisely capturing the multi-modal, high-dimensional grasp distribution. To address this, we propose exploiting a special kind of Deep Generative Model (DGM) based on Normalizing Flows (NFs), an expressive model for learning complex probability distributions. Specifically, we first observed an encouraging improvement in diversity by directly applying a single conditional NFs (cNFs), dubbed FFHFlow-cnf, to learn a grasp distribution conditioned on the incomplete point cloud. However, we also recognized limited performance gains due to restricted expressivity in the latent space. This motivated us to develop a novel flow-based d Deep Latent Variable Model (DLVM), namely FFHFlow-lvm, which facilitates more reasonable latent features, leading to both diverse and accurate grasp synthesis for unseen objects. Unlike Variational Autoencoders (VAEs), the proposed DLVM counteracts typical pitfalls such as mode collapse and mis-specified priors by leveraging two cNFs for the prior and likelihood distributions, which are usually restricted to being isotropic Gaussian. Comprehensive experiments in simulation and real-robot scenarios demonstrate that our method generates more accurate and diverse grasps than the VAE baselines. Additionally, a run-time comparison is conducted to reveal its high potential for real-time applications.
翻译:利用多指手合成多样且精确的抓取是机器人学中重要而具有挑战性的任务。先前专注于生成建模的研究未能精确捕捉多模态、高维度的抓取分布。为解决此问题,我们提出利用一种基于标准化流(NFs)的特殊深度生成模型(DGM),该模型是学习复杂概率分布的表达性模型。具体而言,我们首先观察到直接应用单一条件标准化流(cNFs)——称为FFHFlow-cnf——来学习以不完整点云为条件的抓取分布,在多样性方面取得了令人鼓舞的改进。然而,我们也认识到由于潜在空间表达性受限,性能提升有限。这促使我们开发了一种新颖的基于流的深度潜在变量模型(DLVM),即FFHFlow-lvm,该模型能够生成更合理的潜在特征,从而为未见物体实现既多样又精确的抓取合成。与变分自编码器(VAEs)不同,所提出的DLVM通过为通常被限制为各向同性高斯分布的先验分布和似然分布分别利用两个cNFs,有效规避了模式坍塌和先验分布设定错误等典型缺陷。在仿真和真实机器人场景中的综合实验表明,我们的方法相比VAE基线能生成更精确且更多样化的抓取。此外,通过运行时间对比揭示了该方法在实时应用中的巨大潜力。