Hierarchical policies enable strong performance in many sequential decision-making problems, such as those with high-dimensional action spaces, those requiring long-horizon planning, and settings with sparse rewards. However, learning hierarchical policies from static offline datasets presents a significant challenge. Crucially, actions taken by higher-level policies may not be directly observable within hierarchical controllers, and the offline dataset might have been generated using a different policy structure, hindering the use of standard offline learning algorithms. In this work, we propose OHIO: a framework for offline reinforcement learning (RL) of hierarchical policies. Our framework leverages knowledge of the policy structure to solve the \textit{inverse problem}, recovering the unobservable high-level actions that likely generated the observed data under our hierarchical policy. This approach constructs a dataset suitable for off-the-shelf offline training. We demonstrate our framework on robotic and network optimization problems and show that it substantially outperforms end-to-end RL methods and improves robustness. We investigate a variety of instantiations of our framework, both in direct deployment of policies trained offline and when online fine-tuning is performed. Code and data are available at https://ohio-offline-hierarchical-rl.github.io
翻译:分层策略在许多顺序决策问题中展现出卓越性能,例如高维动作空间问题、需要长程规划的场景以及稀疏奖励环境。然而,从静态离线数据集中学习分层策略面临重大挑战。关键在于,在分层控制器中,高层策略执行的动作可能无法直接观测,且离线数据集可能由不同策略结构生成,这阻碍了标准离线学习算法的应用。本研究提出OHIO:一种用于分层策略的离线强化学习框架。该框架利用策略结构知识求解逆问题,恢复在分层策略下可能生成观测数据的不可观测高层动作。此方法构建了适用于现成离线训练的数据集。我们在机器人及网络优化问题上验证了该框架,结果表明其显著优于端到端强化学习方法,并提升了鲁棒性。我们探究了该框架的多种实现形式,包括离线训练策略的直接部署以及在线微调场景。代码与数据详见 https://ohio-offline-hierarchical-rl.github.io