Hierarchical reinforcement learning (RL) can accelerate long-horizon decision-making by temporally abstracting a policy into multiple levels. Promising results in sparse reward environments have been seen with skills, i.e. sequences of primitive actions. Typically, a skill latent space and policy are discovered from offline data, but the resulting low-level policy can be unreliable due to low-coverage demonstrations or distribution shifts. As a solution, we propose fine-tuning the low-level policy in conjunction with high-level skill selection. Our Skill-Critic algorithm optimizes both the low and high-level policies; these policies are also initialized and regularized by the latent space learned from offline demonstrations to guide the joint policy optimization. We validate our approach in multiple sparse RL environments, including a new sparse reward autonomous racing task in Gran Turismo Sport. The experiments show that Skill-Critic's low-level policy fine-tuning and demonstration-guided regularization are essential for optimal performance. Images and videos are available at https://sites.google.com/view/skill-critic. We plan to open source the code with the final version.
翻译:分层强化学习通过将策略在时间上抽象为多个层级,可加速长时域决策制定。在稀疏奖励环境中,技能(即原始动作序列)已展现出显著效果。通常,技能潜在空间和策略通过离线数据发现,但由此产生的低层策略可能因演示覆盖不足或分布偏移而不可靠。为此,我们提出对低层策略与高层技能选择进行联合微调。我们的技能评论算法同时优化低层和高层策略;这些策略还由离线演示中学习到的潜在空间进行初始化和正则化,以引导联合策略优化。我们在多个稀疏强化学习环境中验证了该方法,包括《跑车浪漫旅》中一项新的稀疏奖励自主竞速任务。实验表明,技能评论的低层策略微调与演示引导的正则化对实现最优性能至关重要。相关图片与视频可访问 https://sites.google.com/view/skill-critic 获取。我们计划在最终版本中开源代码。