Model Predictive Control and Reinforcement Learning: A Unified Framework Based on Dynamic Programming

In this paper we describe a new conceptual framework that connects approximate Dynamic Programming (DP), Model Predictive Control (MPC), and Reinforcement Learning (RL). This framework centers around two algorithms, which are designed largely independently of each other and operate in synergy through the powerful mechanism of Newton's method. We call them the off-line training and the on-line play algorithms. The names are borrowed from some of the major successes of RL involving games; primary examples are the recent (2017) AlphaZero program (which plays chess, [SHS17], [SSS17]), and the similarly structured and earlier (1990s) TD-Gammon program (which plays backgammon, [Tes94], [Tes95], [TeG96]). In these game contexts, the off-line training algorithm is the method used to teach the program how to evaluate positions and to generate good moves at any given position, while the on-line play algorithm is the method used to play in real time against human or computer opponents. Significantly, the synergy between off-line training and on-line play also underlies MPC (as well as other major classes of sequential decision problems), and indeed the MPC design architecture is very similar to the one of AlphaZero and TD-Gammon. This conceptual insight provides a vehicle for bridging the cultural gap between RL and MPC, and sheds new light on some fundamental issues in MPC. These include the enhancement of stability properties through rollout, the treatment of uncertainty through the use of certainty equivalence, the resilience of MPC in adaptive control settings that involve changing system parameters, and the insights provided by the superlinear performance bounds implied by Newton's method.

翻译：本文提出了一种新的概念框架，将近似动态规划、模型预测控制与强化学习联系起来。该框架围绕两种算法构建，这两种算法在很大程度上独立设计，并通过牛顿法的强大机制协同工作。我们将其称为离线训练算法与在线执行算法。这些名称借鉴了强化学习在博弈领域的若干重大成功案例，主要包括近期（2017年）的AlphaZero程序（国际象棋博弈程序，[SHS17], [SSS17]）以及结构类似、出现更早（1990年代）的TD-Gammon程序（双陆棋博弈程序，[Tes94], [Tes95], [TeG96]）。在这些博弈场景中，离线训练算法用于教授程序如何评估局势并在任意给定局势下生成优质走法，而在线执行算法则用于实时对抗人类或计算机对手。值得注意的是，离线训练与在线执行之间的协同作用同样是模型预测控制（以及其他主要类别序贯决策问题）的基础，且模型预测控制的设计架构与AlphaZero和TD-Gammon高度相似。这一概念性认识为弥合强化学习与模型预测控制之间的领域隔阂提供了桥梁，并为模型预测控制的若干基础问题带来新的启示。这些问题包括：通过滚动优化增强稳定性，利用确定性等价处理不确定性，模型预测控制在涉及系统参数变化的自适应控制场景中的鲁棒性，以及牛顿法所隐含的超线性性能边界所提供的理论洞见。