Model Predictive Control (MPC) provides interpretable, tunable locomotion controllers grounded in physical models, but its robustness depends on frequent replanning and is limited by model mismatch and real-time computational constraints. Reinforcement Learning (RL), by contrast, can produce highly robust behaviors through stochastic training but often lacks interpretability, suffers from out-of-distribution failures, and requires intensive reward engineering. This work presents a GPU-parallelized residual architecture that tightly integrates MPC and RL by blending their outputs at the torque-control level. We develop a kinodynamic whole-body MPC formulation evaluated across thousands of agents in parallel at 100 Hz for RL training. The residual policy learns to make targeted corrections to the MPC outputs, combining the interpretability and constraint handling of model-based control with the adaptability of RL. The model-based control prior acts as a strong bias, initializing and guiding the policy towards desirable behavior with a simple set of rewards. Compared to standalone MPC or end-to-end RL, our approach achieves higher sample efficiency, converges to greater asymptotic rewards, expands the range of trackable velocity commands, and enables zero-shot adaptation to unseen gaits and uneven terrain.
翻译:模型预测控制(MPC)提供了基于物理模型的可解释、可调的运动控制器,但其鲁棒性依赖于频繁的重新规划,并受限于模型失配和实时计算约束。相比之下,强化学习(RL)通过随机训练能够产生高度鲁棒的行为,但通常缺乏可解释性,易出现分布外失效,且需要复杂的奖励工程。本研究提出了一种GPU并行化的残差架构,通过在扭矩控制层面融合MPC与RL的输出,实现二者的紧密集成。我们开发了一种运动学动力学全身MPC框架,可在RL训练期间以100赫兹频率并行评估数千个智能体。残差策略学习对MPC输出进行针对性修正,从而将基于模型控制的解释性与约束处理能力,与RL的适应性相结合。基于模型的控制先验作为强偏置,通过简单的奖励集合引导策略初始化并趋向理想行为。与独立MPC或端到端RL相比,本方法实现了更高的样本效率,收敛至更优的渐进奖励,扩展了可跟踪速度指令的范围,并能对未见步态和不平地形实现零样本适应。