An open research question in robotics is how to combine the benefits of model-free reinforcement learning (RL) - known for its strong task performance and flexibility in optimizing general reward formulations - with the robustness and online replanning capabilities of model predictive control (MPC). This paper provides an answer by introducing a new framework called Actor-Critic Model Predictive Control. The key idea is to embed a differentiable MPC within an actor-critic RL framework. The proposed approach leverages the short-term predictive optimization capabilities of MPC with the exploratory and end-to-end training properties of RL. The resulting policy effectively manages both short-term decisions through the MPC-based actor and long-term prediction via the critic network, unifying the benefits of both model-based control and end-to-end learning. We validate our method in both simulation and the real world with a quadcopter platform across various high-level tasks. We show that the proposed architecture can achieve real-time control performance, learn complex behaviors via trial and error, and retain the predictive properties of the MPC to better handle out of distribution behaviour.
翻译:机器人学中的一个开放研究问题是如何结合无模型强化学习(以其强大的任务性能和优化通用奖励公式的灵活性著称)与模型预测控制(以其鲁棒性和在线重规划能力著称)的优势。本文通过引入一种称为"演员-评论家模型预测控制"的新型框架来回答这一问题。其核心思想是将可微分的模型预测控制嵌入到演员-评论家强化学习框架中。所提出的方法利用了模型预测控制的短期预测优化能力与强化学习的探索性和端到端训练特性。由此产生的策略通过基于模型预测控制的演员有效管理短期决策,并通过评论家网络实现长期预测,从而统一了基于模型的控制与端到端学习的优势。我们在仿真和真实世界中,使用四旋翼平台在多种高级任务上验证了该方法。实验表明,所提出的架构能够实现实时控制性能,通过试错学习复杂行为,并保留模型预测控制的预测特性以更好地处理分布外行为。