Robots deployed in the real world must plan motions across diverse scenarios without per-scenario retuning. End-to-end reinforcement learning (RL) can generalize across scenarios but often becomes brittle under distribution shift, reward misspecification, and stochastic interactions. Model predictive path integral (MPPI) control enables strong real-time refinement without gradients, but its performance depends on a well-shaped sampling prior, while manually designing the priors does not scale to multi-scenario deployment. We present HOLO-MPPI (High-level Offline, Low-level Online MPPI), a multi-scenario motion planning framework that combines high-level policy learning with low-level stochastic optimal control. Offline, we learn a high-level policy that proposes scenario-robust plans in an abstract action space, with a learned world model for online rollout. Online, the policy serves as a data-driven prior generator that parameterizes MPPI's sampling distribution conditioned on the current observation and goal. MPPI then optimizes low-level control sequences around this prior in real time to adapt to local disturbances. We instantiate HOLO-MPPI in autonomous driving by designing an effective high-level action space and tailored model architectures. Our evaluation across diverse driving scenarios shows that HOLO-MPPI improves upon MPPI and end-to-end RL baselines while maintaining real-time control.
翻译:部署于真实世界的机器人必须在无需针对每个场景重新调参的情况下,跨不同场景规划运动。端到端强化学习虽能泛化至多场景,但在分布偏移、奖励错配及随机交互下往往变得脆弱。模型预测路径积分控制无需梯度即可实现强实时优化,但其性能依赖于良好构造的采样先验,而人工设计先验无法扩展至多场景部署。我们提出HOLO-MPPI(高层离线、低层在线MPPI),一种结合高层策略学习与低层随机最优控制的多场景运动规划框架。离线阶段,我们在抽象动作空间中学习可提出鲁棒性场景规划的高层策略,并利用所学世界模型进行在线推演;在线阶段,该策略作为数据驱动的先验生成器,基于当前观测与目标参数化MPPI的采样分布。随后,MPPI围绕该先验实时优化低层控制序列,以适应局部扰动。我们通过设计高效高层动作空间与定制化模型架构,将HOLO-MPPI应用于自动驾驶。跨多种驾驶场景的评估表明,HOLO-MPPI在保持实时控制能力的同时,优于MPPI及端到端强化学习基线方法。