Goal-conditioned hierarchical reinforcement learning (GCHRL) decomposes long-horizon tasks into sub-tasks through a hierarchical framework and it has demonstrated promising results across a variety of domains. However, the high-level policy's action space is often excessively large, presenting a significant challenge to effective exploration and resulting in potentially inefficient training. Moreover, the dynamic variability of the low-level policy introduces non-stationarity to the high-level state transition function, significantly impeding the learning of the high-level policy. In this paper, we design a measure of prospect for subgoals by planning in the goal space based on the goal-conditioned value function. Building upon the measure of prospect, we propose a landmark-guided exploration strategy by integrating the measures of prospect and novelty which aims to guide the agent to explore efficiently and improve sample efficiency. To address the non-stationarity arising from the dynamic changes of the low-level policy, we apply a state-specific regularization to the learning of low-level policy, which facilitates stable learning of the hierarchical policy. The experimental results demonstrate that our proposed exploration strategy significantly outperforms the baseline methods across multiple tasks.
翻译:目标条件分层强化学习通过分层框架将长时域任务分解为子任务,已在多个领域展现出良好前景。然而,高层策略的动作空间通常过于庞大,给有效探索带来重大挑战,导致训练效率低下。此外,低层策略的动态变化会引发高层状态转移函数的非平稳性,严重阻碍高层策略的学习。本文通过基于目标条件价值函数在目标空间中进行规划,设计了一种子目标前景度量方法。基于该前景度量,我们提出了一种融合前景与新颖性度量的地标引导探索策略,旨在引导智能体高效探索并提升样本效率。针对低层策略动态变化引发的非平稳性问题,我们对低层策略的学习施加状态特定正则化,以促进分层策略的稳定学习。实验结果表明,所提出的探索策略在多个任务中显著优于基线方法。