Formulating expert policies as macro actions promises to alleviate the long-horizon issue via structured exploration and efficient credit assignment. However, traditional option-based multi-policy transfer methods suffer from inefficient exploration of macro action's length and insufficient exploitation of useful long-duration macro actions. In this paper, a novel algorithm named EASpace (Enhanced Action Space) is proposed, which formulates macro actions in an alternative form to accelerate the learning process using multiple available sub-optimal expert policies. Specifically, EASpace formulates each expert policy into multiple macro actions with different execution {times}. All the macro actions are then integrated into the primitive action space directly. An intrinsic reward, which is proportional to the execution time of macro actions, is introduced to encourage the exploitation of useful macro actions. The corresponding learning rule that is similar to Intra-option Q-learning is employed to improve the data efficiency. Theoretical analysis is presented to show the convergence of the proposed learning rule. The efficiency of EASpace is illustrated by a grid-based game and a multi-agent pursuit problem. The proposed algorithm is also implemented in physical systems to validate its effectiveness.
翻译:将专家策略形式化为宏动作,有望通过结构化探索和高效信用分配缓解长视界问题。然而,传统基于选项的多策略迁移方法存在宏动作长度探索效率低、对有效长时宏动作利用不充分的问题。本文提出一种名为EASpace(增强动作空间)的新算法,该算法以替代形式构建宏动作,利用多个可用的次优专家策略加速学习过程。具体而言,EASpace将每个专家策略转化为多个具有不同执行时长的宏动作,并将所有宏动作直接集成到原始动作空间中。通过引入与宏动作执行时长成正比的内部奖励,鼓励对有效宏动作的利用。采用类似选项内Q学习的相应学习规则提升数据效率,并给出理论分析证明该学习规则的收敛性。通过网格游戏与多智能体追捕问题验证了EASpace的有效性,并在物理系统中实现该算法以验证其实用效果。