This paper addresses the problem of training a reinforcement learning (RL) policy under partial observability by exploiting a privileged, anytime-feasible planner agent available exclusively during training. We formalize this as a Partially Observable Markov Decision Process (POMDP) in which a planner agent with access to an approximate dynamical model and privileged state information guides a learning agent that observes only a lossy projection of the true state. To realize this framework, we introduce an anytime-feasible Model Predictive Control (MPC) algorithm that serves as the planner agent. For the learning agent, we propose Planner-to-Policy Soft Actor-Critic (P2P-SAC), a method that distills the planner agent's privileged knowledge to mitigate partial observability and thereby improve both sample efficiency and final policy performance. We support this framework with rigorous theoretical analysis. Finally, we validate our approach in simulation using NVIDIA Isaac Lab and successfully deploy it on a real-world Unitree Go2 quadruped navigating complex, obstacle-rich environments.
翻译:本文研究利用仅在训练时可用的特权、随时可行性规划器智能体来训练部分可观测条件下的强化学习策略问题。我们将此形式化为一个部分可观测马尔可夫决策过程(POMDP),其中,一个可访问近似动力学模型和特权状态信息的规划器智能体,引导一个仅观测真实状态有损投影的学习智能体。为实现该框架,我们引入了一种用作规划器智能体的随时可行性模型预测控制(MPC)算法。对于学习智能体,我们提出了规划器到策略软演员-评论家(P2P-SAC)方法,该方法通过蒸馏规划器智能体的特权知识来缓解部分可观测性,从而提高样本效率和最终策略性能。我们为该框架提供了严格的理论分析支持。最后,我们使用NVIDIA Isaac Lab在仿真环境中验证了该方法,并在现实世界的Unitree Go2四足机器人上成功部署,使其能够在复杂且充满障碍物的环境中导航。