Exploration in sparse-reward reinforcement learning is difficult due to the requirement of long, coordinated sequences of actions in order to achieve any reward. Moreover, in continuous action spaces there are an infinite number of possible actions, which only increases the difficulty of exploration. One class of methods designed to address these issues forms temporally extended actions, often called skills, from interaction data collected in the same domain, and optimizes a policy on top of this new action space. Typically such methods require a lengthy pretraining phase, especially in continuous action spaces, in order to form the skills before reinforcement learning can begin. Given prior evidence that the full range of the continuous action space is not required in such tasks, we propose a novel approach to skill-generation with two components. First we discretize the action space through clustering, and second we leverage a tokenization technique borrowed from natural language processing to generate temporally extended actions. Such a method outperforms baselines for skill-generation in several challenging sparse-reward domains, and requires orders-of-magnitude less computation in skill-generation and online rollouts.
翻译:稀疏奖励强化学习中的探索因需要长序列协调动作才能获得任何奖励而面临困难。此外,在连续动作空间中存在无限可能的动作,这进一步增加了探索难度。为解决这些问题,一类方法通过从同一领域收集的交互数据中构建时间扩展动作(通常称为技能),并在此新动作空间上优化策略。此类方法通常需要漫长的预训练阶段(尤其在连续动作空间中)才能构建技能,随后才能开始强化学习。基于现有证据表明此类任务无需使用完整连续动作空间,我们提出了一种包含两个组件的新型技能生成方法:首先通过聚类离散化动作空间,其次借鉴自然语言处理中的分词技术生成时间扩展动作。该方法在多个具有挑战性的稀疏奖励领域中优于基线技能生成方法,且在技能生成和在线交互中的计算量减少数个数量级。