This paper identifies and addresses the problems with naively combining (reinforcement) learning-based controllers and state estimators for robotic in-hand manipulation. Specifically, we tackle the challenging task of purely tactile, goal-conditioned, dextrous in-hand reorientation with the hand pointing downwards. Due to the limited sensing available, many control strategies that are feasible in simulation when having full knowledge of the object's state do not allow for accurate state estimation. Hence, separately training the controller and the estimator and combining the two at test time leads to poor performance. We solve this problem by coupling the control policy to the state estimator already during training in simulation. This approach leads to more robust state estimation and overall higher performance on the task while maintaining an interpretability advantage over end-to-end policy learning. With our GPU-accelerated implementation, learning from scratch takes a median training time of only 6.5 hours on a single, low-cost GPU. In simulation experiments with the DLR-Hand II and for four significantly different object shapes, we provide an in-depth analysis of the performance of our approach. We demonstrate the successful sim2real transfer by rotating the four objects to all 24 orientations in the $\pi/2$ discretization of SO(3), which has never been achieved for such a diverse set of shapes. Finally, our method can reorient a cube consecutively to nine goals (median), which was beyond the reach of previous methods in this challenging setting.
翻译:本文识别并解决了将基于(强化)学习的控制器与状态估算器直接结合用于机器人手内操作时存在的问题。具体而言,我们针对手部朝下的纯触觉、目标导向、灵巧手内重定向这一具有挑战性的任务展开研究。由于可用的传感信息有限,许多在仿真中具备完整物体状态信息时可行的控制策略无法实现准确的状态估计。因此,单独训练控制器和估算器并在测试时简单组合会导致性能低下。我们通过在仿真训练阶段就将控制策略与状态估算器联合耦合来解决这一问题。该方法不仅保持了相对于端到端策略学习的可解释性优势,还能实现更鲁棒的状态估计并提升任务整体性能。借助我们的GPU加速实现,在单个低成本GPU上从零开始训练的中位时间仅需6.5小时。通过使用DLR-Hand II机械手对四种形状差异显著的物体进行仿真实验,我们深入分析了方法的性能。通过将四个物体旋转至SO(3)群$\pi/2$离散化后的全部24种朝向,我们成功验证了仿真到现实的迁移能力,这在如此多样化的物体形状集合上尚属首次。最后,我们的方法能连续将立方体重定向至九个目标方向(中位数),这在此前具有挑战性的设定中超出了现有方法的能力范围。