Hierarchical reinforcement learning (HRL) has the potential to solve complex long horizon tasks using temporal abstraction and increased exploration. However, hierarchical agents are difficult to train due to inherent non-stationarity. We present primitive enabled adaptive relabeling (PEAR), a two-phase approach where we first perform adaptive relabeling on a few expert demonstrations to generate efficient subgoal supervision, and then jointly optimize HRL agents by employing reinforcement learning (RL) and imitation learning (IL). We perform theoretical analysis to $(i)$ bound the sub-optimality of our approach, and $(ii)$ derive a generalized plug-and-play framework for joint optimization using RL and IL. PEAR uses a handful of expert demonstrations and makes minimal limiting assumptions on the task structure. Additionally, it can be easily integrated with typical model free RL algorithms to produce a practical HRL algorithm. We perform experiments on challenging robotic environments and show that PEAR is able to solve tasks that require long term decision making. We empirically show that PEAR exhibits improved performance and sample efficiency over previous hierarchical and non-hierarchical approaches. We also perform real world robotic experiments on complex tasks and demonstrate that PEAR consistently outperforms the baselines.
翻译:分层强化学习凭借时间抽象与增强探索能力,有望解决复杂的长时域任务。然而分层智能体因固有非平稳性而难以训练。本文提出基于基元的自适应重标记方法(PEAR),该两阶段方法首先对少量专家演示实施自适应重标记以生成高效的子目标监督信号,随后通过联合优化强化学习与模仿学习来训练分层强化学习智能体。我们进行理论分析以:(i) 约束本方法的次优性边界,(ii) 推导出适用于强化学习与模仿学习联合优化的通用即插即用框架。PEAR仅需少量专家演示,对任务结构施加最少限制性假设。此外,该方法可便捷集成至典型无模型强化学习算法,形成实用的分层强化学习算法。我们在挑战性机器人环境中开展实验,验证PEAR能解决需要长期决策的任务。实验表明,相较于以往的分层与非分层方法,PEAR在性能与样本效率上均有所提升。我们还针对复杂任务开展真实世界机器人实验,证实PEAR持续优于基线方法。