Recent work in hierarchical reinforcement learning has shown success in scaling to billions of timesteps when learning over a set of predefined option reward functions. We show that, instead of using a single reward function per option, the reward functions can be effectively used to induce a space of behaviours, by letting the controller specify linear combinations over reward functions, allowing a more expressive set of policies to be represented. We call this method Hierarchical Behaviour Spaces (HBS). We evaluate HBS on the NetHack Learning Environment, demonstrating strong performance. We conduct a series of experiments and determine that, perhaps going against conventional wisdom, the benefits of hierarchy in our method come from increased exploration rather than long term reasoning.
翻译:近期分层强化学习领域的研究表明,在基于预定义选项奖励函数集合进行学习时,该方法可成功扩展至数十亿时间步的规模。我们提出,每个选项无需使用单一奖励函数,而是通过控制器指定奖励函数的线性组合,使奖励函数能有效诱导出行为空间,从而表征更具表达力的策略集合。我们将该方法命名为分层行为空间(Hierarchical Behaviour Spaces, HBS)。在NetHack学习环境中的评估显示,HBS展现了强劲性能。通过一系列实验,我们确定(或许违背传统认知)该方法中层级结构的主要优势源于探索增强,而非长期推理能力的提升。