Humans excel in grasping objects through diverse and robust policies, many of which are so probabilistically rare that exploration-based learning methods hardly observe and learn. Inspired by the human learning process, we propose a method to extract and exploit latent intents from demonstrations, and then learn diverse and robust grasping policies through self-exploration. The resulting policy can grasp challenging objects in various environments with an off-the-shelf parallel gripper. The key component is a learned intention estimator, which maps gripper pose and visual sensory to a set of sub-intents covering important phases of the grasping movement. Sub-intents can be used to build an intrinsic reward to guide policy learning. The learned policy demonstrates remarkable zero-shot generalization from simulation to the real world while retaining its robustness against states that have never been encountered during training, novel objects such as protractors and user manuals, and environments such as the cluttered conveyor.
翻译:人类通过多样且鲁棒的策略擅长抓取物体,其中许多策略在概率上极为罕见,以至于基于探索的学习方法难以观察并学习。受人类学习过程的启发,我们提出了一种方法,从示范中提取并利用潜在意图,然后通过自我探索学习多样且鲁棒的抓取策略。所得到的策略能够使用现成的平行夹爪在各种环境中抓取具有挑战性的物体。其关键组件是一个学习得到的意图估计器,它将夹爪位姿和视觉感知映射到涵盖抓取运动重要阶段的一组子意图。子意图可用于构建内在奖励以指导策略学习。所学策略展现出从仿真到现实世界的卓越零样本泛化能力,同时保持对训练中从未遇到的状态、新物体(如量角器和用户手册)以及复杂杂乱传送带等环境的鲁棒性。