Hierarchical reinforcement learning (HRL) provides a promising solution for complex tasks with sparse rewards of intelligent agents, which uses a hierarchical framework that divides tasks into subgoals and completes them sequentially. However, current methods struggle to find suitable subgoals for ensuring a stable learning process. Without additional guidance, it is impractical to rely solely on exploration or heuristics methods to determine subgoals in a large goal space. To address the issue, We propose a general hierarchical reinforcement learning framework incorporating human feedback and dynamic distance constraints (MENTOR). MENTOR acts as a "mentor", incorporating human feedback into high-level policy learning, to find better subgoals. As for low-level policy, MENTOR designs a dual policy for exploration-exploitation decoupling respectively to stabilize the training. Furthermore, although humans can simply break down tasks into subgoals to guide the right learning direction, subgoals that are too difficult or too easy can still hinder downstream learning efficiency. We propose the Dynamic Distance Constraint (DDC) mechanism dynamically adjusting the space of optional subgoals. Thus MENTOR can generate subgoals matching the low-level policy learning process from easy to hard. Extensive experiments demonstrate that MENTOR uses a small amount of human feedback to achieve significant improvement in complex tasks with sparse rewards.
翻译:分层强化学习为智能体处理稀疏奖励的复杂任务提供了有效方案,通过层次化结构将任务分解为子目标并依次完成。然而现有方法在寻找合适子目标以确保稳定学习过程方面仍面临挑战。在没有额外引导的情况下,仅依靠探索或启发式方法在大目标空间中确定子目标是不切实际的。为解决该问题,我们提出融合人类反馈与动态距离约束的通用分层强化学习框架MENTOR。该框架如同"导师",将人类反馈融入高层策略学习以发现更优子目标;针对底层策略,设计了双策略机制分别处理探索与利用的解耦,从而稳定训练过程。此外,尽管人类可简单地将任务分解为子目标以引导正确学习方向,但过于困难或简单的子目标仍会阻碍下游学习效率。我们提出的动态距离约束机制可动态调整可选子目标空间,使得MENTOR能生成与底层策略学习过程相匹配的、由易到难的子目标。大量实验表明,MENTOR仅需少量人类反馈即可在稀疏奖励的复杂任务中取得显著性能提升。