Reinforcement learning (RL) and trajectory optimization (TO) present strong complementary advantages. On one hand, RL approaches are able to learn global control policies directly from data, but generally require large sample sizes to properly converge towards feasible policies. On the other hand, TO methods are able to exploit gradient-based information extracted from simulators to quickly converge towards a locally optimal control trajectory which is only valid within the vicinity of the solution. Over the past decade, several approaches have aimed to adequately combine the two classes of methods in order to obtain the best of both worlds. Following on from this line of research, we propose several improvements on top of these approaches to learn global control policies quicker, notably by leveraging sensitivity information stemming from TO methods via Sobolev learning, and augmented Lagrangian techniques to enforce the consensus between TO and policy learning. We evaluate the benefits of these improvements on various classical tasks in robotics through comparison with existing approaches in the literature.
翻译:强化学习(RL)与轨迹优化(TO)展现出显著的互补优势。一方面,RL方法能够直接从数据中学习全局控制策略,但通常需要大量样本才能有效收敛至可行策略。另一方面,TO方法能够利用从仿真器中提取的基于梯度的信息,快速收敛至仅在解邻域内有效的局部最优控制轨迹。过去十年中,多种方法致力于合理结合这两类方法,以期兼得两者优势。基于此研究脉络,我们提出了一系列改进措施以加速全局控制策略的学习进程,具体包括:通过Sobolev学习利用TO方法产生的敏感性信息,以及采用增广拉格朗日技术强制执行TO与策略学习之间的一致性。通过与文献中现有方法在多种经典机器人任务上的对比,我们验证了这些改进措施的有效性。