Reasoning at multiple levels of temporal abstraction is one of the key attributes of intelligence. In reinforcement learning, this is often modeled through temporally extended courses of actions called options. Options allow agents to make predictions and to operate at different levels of abstraction within an environment. Nevertheless, approaches based on the options framework often start with the assumption that a reasonable set of options is known beforehand. When this is not the case, there are no definitive answers for which options one should consider. In this paper, we argue that the successor representation (SR), which encodes states based on the pattern of state visitation that follows them, can be seen as a natural substrate for the discovery and use of temporal abstractions. To support our claim, we take a big picture view of recent results, showing how the SR can be used to discover options that facilitate either temporally-extended exploration or planning. We cast these results as instantiations of a general framework for option discovery in which the agent's representation is used to identify useful options, which are then used to further improve its representation. This results in a virtuous, never-ending, cycle in which both the representation and the options are constantly refined based on each other. Beyond option discovery itself, we also discuss how the SR allows us to augment a set of options into a combinatorially large counterpart without additional learning. This is achieved through the combination of previously learned options. Our empirical evaluation focuses on options discovered for exploration and on the use of the SR to combine them. The results of our experiments shed light on important design decisions involved in the definition of options and demonstrate the synergy of different methods based on the SR, such as eigenoptions and the option keyboard.
翻译:多层级时序抽象推理是智能的关键特征之一。在强化学习中,这通常通过称为选项的时序扩展动作序列进行建模。选项使智能体能够进行预测并在环境的不同抽象层级中运作。然而,基于选项框架的方法往往预先假设已知一组合理的选项。当情况并非如此时,对于应该考虑哪些选项尚无明确答案。本文提出,后继表示法(SR)通过编码后继状态访问模式来表征状态,可视为发现和利用时序抽象的自然基础。为支持这一观点,我们从宏观视角审视近期研究成果,展示如何利用SR发现促进时序扩展探索或规划的选项。我们将这些成果归纳为通用选项发现框架的具体实例:智能体利用其表征识别有用选项,再通过后者进一步改进表征。这形成了表征与选项相互持续优化的良性循环。除了选项发现本身,我们还讨论了SR如何在不增加额外学习的情况下,将现有选项集扩展为组合量级更大的选项库——这通过组合先前习得的选项实现。我们的实证评估聚焦于为探索发现的选项及其SR组合应用。实验结果揭示了选项定义中的重要设计决策,并展示了基于SR的多种方法(如特征选项与选项键盘)的协同效应。