Conventional reinforcement learning (RL) methods can successfully solve a wide range of sequential decision problems. However, learning policies that can generalize predictably across multiple tasks in a setting with non-Markovian reward specifications is a challenging problem. We propose to use successor features to learn a policy basis so that each (sub)policy in it solves a well-defined subproblem. In a task described by a finite state automaton (FSA) that involves the same set of subproblems, the combination of these (sub)policies can then be used to generate an optimal solution without additional learning. In contrast to other methods that combine (sub)policies via planning, our method asymptotically attains global optimality, even in stochastic environments.
翻译:传统强化学习方法能够成功解决广泛的序列决策问题。然而,在具有非马尔可夫奖励规范的环境中,学习能够跨多个任务实现可预测泛化的策略仍是一个挑战性难题。我们提出利用后继特征学习策略基,使得其中的每个(子)策略能够解决明确定义的子问题。在涉及相同子问题集合的有限状态自动机所描述的任务中,这些(子)策略的组合便可在无需额外学习的情况下生成最优解。相较于其他通过规划组合(子)策略的方法,我们的方法即使在随机环境中也能渐近地达到全局最优性。