Reward-Respecting Subtasks for Model-Based Reinforcement Learning

To achieve the ambitious goals of artificial intelligence, reinforcement learning must include planning with a model of the world that is abstract in state and time. Deep learning has made progress with state abstraction, but temporal abstraction has rarely been used, despite extensively developed theory based on the options framework. One reason for this is that the space of possible options is immense, and the methods previously proposed for option discovery do not take into account how the option models will be used in planning. Options are typically discovered by posing subsidiary tasks, such as reaching a bottleneck state or maximizing the cumulative sum of a sensory signal other than reward. Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process. In most previous work, the subtasks ignore the reward on the original problem, whereas we propose subtasks that use the original reward plus a bonus based on a feature of the state at the time the option terminates. We show that option models obtained from such reward-respecting subtasks are much more likely to be useful in planning than eigenoptions, shortest path options based on bottleneck states, or reward-respecting options generated by the option-critic. Reward respecting subtasks strongly constrain the space of options and thereby also provide a partial solution to the problem of option discovery. Finally, we show how values, policies, options, and models can all be learned online and off-policy using standard algorithms and general value functions.

翻译：为了实现人工智能的宏伟目标，强化学习必须包含对世界模型的规划，该模型在状态和时间维度上具有抽象性。深度学习在状态抽象方面取得了进展，但时间抽象却鲜有应用，尽管基于选项框架的理论已经发展得相当完善。其中一个原因是可能的选项空间极为庞大，而此前提出的选项发现方法并未考虑选项模型在规划中的使用方式。通常，选项是通过设定辅助任务（例如到达瓶颈状态或最大化除奖励之外的感官信号的累积和）来发现的。每个子任务被求解以生成一个选项，然后学习该选项的模型并将其提供给规划过程。在大多数先前的工作中，子任务忽略了原始问题中的奖励，而我们提出的子任务则使用原始奖励加上基于选项终止时状态特征的奖励加成。我们证明，与特征选项、基于瓶颈状态的最短路径选项或由选项-批评家生成的尊重奖励的选项相比，从这种尊重奖励的子任务中获得的选项模型在规划中更可能有用。尊重奖励的子任务极大地约束了选项空间，从而也为选项发现问题提供了部分解决方案。最后，我们展示了如何使用标准算法和通用价值函数在线和离策略地学习价值、策略、选项和模型。