The sample inefficiency of reinforcement learning (RL) remains a significant challenge in robotics. RL requires large-scale simulation and can still cause long training times, slowing research and innovation. This issue is particularly pronounced in vision-based control tasks where reliable state estimates are not accessible. Differentiable simulation offers an alternative by enabling gradient back-propagation through the dynamics model, providing low-variance analytical policy gradients and, hence, higher sample efficiency. However, its usage for real-world robotic tasks has yet been limited. This work demonstrates the great potential of differentiable simulation for learning quadrotor control. We show that training in differentiable simulation significantly outperforms model-free RL in terms of both sample efficiency and training time, allowing a policy to learn to recover a quadrotor in seconds when providing vehicle states and in minutes when relying solely on visual features. The key to our success is two-fold. First, the use of a simple surrogate model for gradient computation greatly accelerates training without sacrificing control performance. Second, combining state representation learning with policy learning enhances convergence speed in tasks where only visual features are observable. These findings highlight the potential of differentiable simulation for real-world robotics and offer a compelling alternative to conventional RL approaches.
翻译:强化学习(RL)在机器人学中仍面临样本效率低下的重大挑战。RL需要大规模仿真,且仍会导致较长的训练时间,从而延缓研究与创新进程。这一问题在基于视觉的控制任务中尤为突出,因为此类任务通常无法获取可靠的状态估计。可微分仿真通过实现动力学模型中的梯度反向传播,提供了低方差解析策略梯度,从而提供了更高的样本效率,为这一问题提供了替代方案。然而,其在实际机器人任务中的应用仍较为有限。本工作展示了可微分仿真在学习四旋翼飞行器控制方面的巨大潜力。我们证明,在可微分仿真中进行训练,在样本效率和训练时间方面均显著优于无模型RL:当提供飞行器状态时,策略可在数秒内学会恢复四旋翼飞行器;当仅依赖视觉特征时,则可在数分钟内学会。我们成功的关键在于两点。首先,使用简单的代理模型进行梯度计算,在不牺牲控制性能的前提下极大地加速了训练。其次,将状态表征学习与策略学习相结合,提升了在仅能观测视觉特征的任务中的收敛速度。这些发现凸显了可微分仿真在实际机器人应用中的潜力,并为传统RL方法提供了一个极具吸引力的替代方案。