Grasping objects with limited or no prior knowledge about them is a highly relevant skill in assistive robotics. Still, in this general setting, it has remained an open problem, especially when it comes to only partial observability and versatile grasping with multi-fingered hands. We present a novel, fast, and high fidelity deep learning pipeline consisting of a shape completion module that is based on a single depth image, and followed by a grasp predictor that is based on the predicted object shape. The shape completion network is based on VQDIF and predicts spatial occupancy values at arbitrary query points. As grasp predictor, we use our two-stage architecture that first generates hand poses using an autoregressive model and then regresses finger joint configurations per pose. Critical factors turn out to be sufficient data realism and augmentation, as well as special attention to difficult cases during training. Experiments on a physical robot platform demonstrate successful grasping of a wide range of household objects based on a depth image from a single viewpoint. The whole pipeline is fast, taking only about 1 s for completing the object's shape (0.7 s) and generating 1000 grasps (0.3 s).
翻译:对于缺乏或仅有少量先验知识的目标物体进行抓取,是辅助机器人领域的一项重要技能。然而,在这一通用设定下,尤其是在仅能部分观测以及需要多指手实现通用抓取的情况下,该问题仍是一个开放挑战。我们提出了一种新型、快速且高保真的深度学习流水线,包含基于单张深度图像的形状补全模块,以及基于预测物体形状的抓取预测模块。形状补全网络基于VQDIF,可在任意查询点预测空间占用值。作为抓取预测器,我们采用两阶段架构:首先使用自回归模型生成手部姿态,然后回归每个姿态对应的手指关节构型。关键因素在于充分的数据真实性与增强策略,以及在训练中重点关注困难样本。在物理机器人平台上的实验表明,该方法能基于单视角深度图像成功抓取多种家居物体。整个流水线速度快,补全物体形状(0.7秒)和生成1000个抓取姿态(0.3秒)仅需约1秒。