Hierarchical reinforcement learning (HRL) has the potential to solve complex long horizon tasks using temporal abstraction and increased exploration. However, hierarchical agents are difficult to train due to inherent non-stationarity. We present primitive enabled adaptive relabeling (PEAR), a two-phase approach where we first perform adaptive relabeling on a few expert demonstrations to generate efficient subgoal supervision, and then jointly optimize HRL agents by employing reinforcement learning (RL) and imitation learning (IL). We perform theoretical analysis to $(i)$ bound the sub-optimality of our approach, and $(ii)$ derive a generalized plug-and-play framework for joint optimization using RL and IL. PEAR uses a handful of expert demonstrations and makes minimal limiting assumptions on the task structure. Additionally, it can be easily integrated with typical model free RL algorithms to produce a practical HRL algorithm. We perform experiments on challenging robotic environments and show that PEAR is able to solve tasks that require long term decision making. We empirically show that PEAR exhibits improved performance and sample efficiency over previous hierarchical and non-hierarchical approaches. We also perform real world robotic experiments on complex tasks and demonstrate that PEAR consistently outperforms the baselines.
翻译:层次强化学习(HRL)通过时间抽象与增强探索机制,能够解决复杂的长时域任务。然而,层次化智能体因固有的非平稳性而难以训练。我们提出原语启发的自适应重标记方法(PEAR),这是一种两阶段方法:首先对少量专家演示进行自适应重标记以生成高效的子目标监督信号,随后通过联合优化强化学习(RL)与模仿学习(IL)来训练HRL智能体。我们进行理论分析以:(i) 限定所提方法的次优性边界;(ii) 推导出基于RL与IL联合优化的通用即插即用框架。PEAR仅需少量专家演示,对任务结构假设极少,且能轻松集成至典型无模型RL算法中,形成实用的HRL算法。在挑战性机器人环境的实验中,PEAR能够有效解决需要长期决策的任务。实验结果表明,相较以往层次化与非层次化方法,PEAR展现出更优的性能与样本效率。我们还在复杂任务的实体机器人实验中验证,PEAR始终优于基线方法。