Despite its success, Model Predictive Control (MPC) often requires intensive task-specific engineering and tuning. On the other hand, Reinforcement Learning (RL) architectures minimize this effort, but need extensive data collection and lack interpretability and safety. An open research question is how to combine the advantages of RL and MPC to exploit the best of both worlds. This paper introduces a novel modular RL architecture that bridges these two approaches. By placing a differentiable MPC in the heart of an actor-critic RL agent, the proposed system enables short-term predictions and optimization of actions based on system dynamics, while retaining the end-to-end training benefits and exploratory behavior of an RL agent. The proposed approach effectively handles two different time-horizon scales: short-term decisions managed by the actor MPC and long term ones managed by the critic network. This provides a promising direction for RL, which combines the advantages of model-based and end-to-end learning methods. We validate the approach in simulated and real-world experiments on a quadcopter platform performing different high-level tasks, and show that the proposed method can learn complex behaviours end-to-end while retaining the properties of an MPC.
翻译:尽管模型预测控制(MPC)取得了成功,但其通常需要针对特定任务进行大量的工程设计和参数调优。另一方面,强化学习(RL)架构虽能减少此类工作,却需要大量数据收集,且缺乏可解释性和安全性。如何结合RL与MPC的优势以取长补短,仍是一个待解决的研究问题。本文提出了一种新颖的模块化RL架构,将这两种方法相融合。通过在演员-评论家RL智能体核心中嵌入可微分的MPC模块,所提系统能够基于系统动力学进行短期预测与动作优化,同时保留RL智能体的端到端训练优势与探索行为。该方法有效处理了两个不同时间尺度的决策:由演员MPC管理的短期决策与由评论家网络管理的长期决策。这为RL提供了有前景的发展方向,融合了基于模型方法与端到端学习方法的优势。我们在四旋翼飞行器平台上通过模拟与真实实验验证了该方法执行不同高层任务的能力,结果表明所提方法能在保持MPC特性的同时端到端学习复杂行为。