Conventional reinforcement learning (RL) methods can successfully solve a wide range of sequential decision problems. However, learning policies that can generalize predictably across multiple tasks in a setting with non-Markovian reward specifications is a challenging problem. We propose to use successor features to learn a policy basis so that each (sub)policy in it solves a well-defined subproblem. In a task described by a finite state automaton (FSA) that involves the same set of subproblems, the combination of these (sub)policies can then be used to generate an optimal solution without additional learning. In contrast to other methods that combine (sub)policies via planning, our method asymptotically attains global optimality, even in stochastic environments.
翻译:传统强化学习方法能够成功解决广泛的序列决策问题。然而,在具有非马尔可夫奖励规范的多任务环境中,学习能跨任务可预测泛化的策略是一项具有挑战性的问题。我们提出利用后继特征学习一组策略基元,使得每个(子)策略都能解决一个定义明确的子问题。在涉及相同子问题集且由有限状态自动机描述的任务中,这些(子)策略的组合可以被用来生成最优解,而无需额外学习。与其他通过规划组合(子)策略的方法相比,我们的方法即使在随机环境中也能渐近地达到全局最优性。