Recent advances in GPU-based parallel simulation have enabled practitioners to collect large amounts of data and train complex control policies using deep reinforcement learning (RL), on commodity GPUs. However, such successes for RL in robotics have been limited to tasks sufficiently simulated by fast rigid-body dynamics. Simulation techniques for soft bodies are comparatively several orders of magnitude slower, thereby limiting the use of RL due to sample complexity requirements. To address this challenge, this paper presents both a novel RL algorithm and a simulation platform to enable scaling RL on tasks involving rigid bodies and deformables. We introduce Soft Analytic Policy Optimization (SAPO), a maximum entropy first-order model-based actor-critic RL algorithm, which uses first-order analytic gradients from differentiable simulation to train a stochastic actor to maximize expected return and entropy. Alongside our approach, we develop Rewarped, a parallel differentiable multiphysics simulation platform that supports simulating various materials beyond rigid bodies. We re-implement challenging manipulation and locomotion tasks in Rewarped, and show that SAPO outperforms baselines over a range of tasks that involve interaction between rigid bodies, articulations, and deformables.
翻译:近年来,基于GPU的并行仿真技术取得显著进展,使得研究人员能够在商用GPU上收集大量数据,并利用深度强化学习训练复杂的控制策略。然而,强化学习在机器人领域的成功应用目前仍局限于可通过快速刚体动力学充分模拟的任务。相比之下,软体仿真技术在计算速度上要慢数个数量级,由于样本复杂度的要求,这限制了强化学习的应用。为应对这一挑战,本文提出了一种新型强化学习算法与仿真平台,以实现在涉及刚体与可变形体的任务中扩展强化学习的应用。我们提出了软解析策略优化算法,这是一种基于最大熵一阶模型的行动者-评论家强化学习算法,该算法利用可微分仿真产生的一阶解析梯度,训练随机行动者以最大化期望回报与熵。与此同时,我们开发了Rewarped——一个支持模拟多种非刚体材料的并行可微分多物理场仿真平台。我们在Rewarped中重新实现了具有挑战性的操控与运动任务,并证明SAPO在一系列涉及刚体、关节结构与可变形体交互的任务中均优于基线方法。