We consider how model-based solvers can be leveraged to guide training of a universal policy to control from any feasible start state to any feasible goal in a contact-rich manipulation setting. While Reinforcement Learning (RL) has demonstrated its strength in such settings, it may struggle to sufficiently explore and discover complex manipulation strategies, especially in sparse-reward settings. Our approach is based on the idea of a lower-dimensional manifold of feasible, likely-visited states during such manipulation and to guide RL with a sampler from this manifold. We propose Sample-Guided RL, which uses model-based constraint solvers to efficiently sample feasible configurations (satisfying differentiable collision, contact, and force constraints) and leverage them to guide RL for universal (goal-conditioned) manipulation policies. We study using this data directly to bias state visitation, as well as using black-box optimization of open-loop trajectories between random configurations to impose a state bias and optionally add a behavior cloning loss. In a minimalistic double sphere manipulation setting, Sample-Guided RL discovers complex manipulation strategies and achieves high success rates in reaching any statically stable state. In a more challenging panda arm setting, our approach achieves a significant success rate over a near-zero baseline, and demonstrates a breadth of complex whole-body-contact manipulation strategies.
翻译:本文探讨了如何利用基于模型的求解器来引导训练一种通用策略,该策略能够在接触丰富的操作环境中,从任意可行起始状态控制至任意可行目标。尽管强化学习(RL)在此类场景中已展现其优势,但在稀疏奖励设置下,它可能难以充分探索并发现复杂的操作策略。我们的方法基于这样一个理念:在此类操作过程中存在一个低维流形,其中包含可行且可能被访问的状态,并利用来自该流形的采样器来引导强化学习。我们提出了采样引导强化学习(Sample-Guided RL),该方法使用基于模型的约束求解器高效采样满足可微碰撞、接触及力约束的可行构型,并利用这些采样来引导强化学习训练通用(目标条件)操作策略。我们研究了直接使用此数据来偏置状态访问的方法,以及使用黑盒优化在随机构型间生成开环轨迹以施加状态偏置,并可选择性地添加行为克隆损失。在一个极简的双球体操作环境中,采样引导强化学习发现了复杂的操作策略,并在到达任意静态稳定状态方面实现了高成功率。在一个更具挑战性的熊猫机械臂环境中,我们的方法相较于近乎零的基线取得了显著的成功率,并展示了一系列复杂的全身接触操作策略。