Model Predictive Control (MPC) has been demonstrated to be effective in continuous control tasks. When a world model and a value function are available, planning a sequence of actions ahead of time leads to a better policy. Existing methods typically obtain the value function and the corresponding policy in a model-free manner. However, we find that such an approach struggles with complex tasks, resulting in poor policy learning and inaccurate value estimation. To address this problem, we leverage the strengths of MPC itself. In this work, we introduce Bootstrapped Model Predictive Control (BMPC), a novel algorithm that performs policy learning in a bootstrapped manner. BMPC learns a network policy by imitating an MPC expert, and in turn, uses this policy to guide the MPC process. Combined with model-based TD-learning, our policy learning yields better value estimation and further boosts the efficiency of MPC. We also introduce a lazy reanalyze mechanism, which enables computationally efficient imitation learning. Our method achieves superior performance over prior works on diverse continuous control tasks. In particular, on challenging high-dimensional locomotion tasks, BMPC significantly improves data efficiency while also enhancing asymptotic performance and training stability, with comparable training time and smaller network sizes. Code is available at https://github.com/wertyuilife2/bmpc.
翻译:模型预测控制(MPC)已被证明在连续控制任务中是有效的。当存在世界模型和价值函数时,提前规划一系列动作可以产生更优的策略。现有方法通常以无模型方式获取价值函数及相应策略。然而,我们发现这种方法在处理复杂任务时存在困难,导致策略学习效果不佳和价值估计不准确。为解决此问题,我们利用MPC自身的优势。本文提出引导式模型预测控制(BMPC),这是一种以引导方式执行策略学习的新算法。BMPC通过模仿MPC专家来学习网络策略,并反过来利用该策略指导MPC过程。结合基于模型的时序差分学习,我们的策略学习产生了更优的价值估计,并进一步提升了MPC的效率。我们还引入了一种惰性重分析机制,实现了计算高效的模仿学习。在多种连续控制任务上,我们的方法取得了优于先前工作的性能。特别是在具有挑战性的高维运动任务中,BMPC在保持相当训练时间和较小网络规模的条件下,显著提高了数据效率,同时增强了渐近性能和训练稳定性。代码发布于 https://github.com/wertyuilife2/bmpc。