To solve a task with reinforcement learning (RL), it is necessary to formally specify the goal of that task. Although most RL algorithms require that the goal is formalised as a Markovian reward function, alternatives have been developed (such as Linear Temporal Logic and Multi-Objective Reinforcement Learning). Moreover, it is well known that some of these formalisms are able to express certain tasks that other formalisms cannot express. However, there has not yet been any thorough analysis of how these formalisms relate to each other in terms of expressivity. In this work, we fill this gap in the existing literature by providing a comprehensive comparison of the expressivities of 17 objective-specification formalisms in RL. We place these formalisms in a preorder based on their expressive power, and present this preorder as a Hasse diagram. We find a variety of limitations for the different formalisms, and that no formalism is both dominantly expressive and straightforward to optimise with current techniques. For example, we prove that each of Regularised RL, Outer Nonlinear Markov Rewards, Reward Machines, Linear Temporal Logic, and Limit Average Rewards can express an objective that the others cannot. Our findings have implications for both policy optimisation and reward learning. Firstly, we identify expressivity limitations which are important to consider when specifying objectives in practice. Secondly, our results highlight the need for future research which adapts reward learning to work with a variety of formalisms, since many existing reward learning methods implicitly assume that desired objectives can be expressed with Markovian rewards. Our work contributes towards a more cohesive understanding of the costs and benefits of different RL objective-specification formalisms.
翻译:为了通过强化学习解决一项任务,需要正式规定该任务的目标。尽管大多数强化学习算法要求目标被形式化为马尔可夫奖励函数,但已有其他替代形式被开发出来(如线性时态逻辑和多目标强化学习)。此外,众所周知,某些形式主义能够表达其他形式主义无法表达的特定任务。然而,目前尚未有对这些形式主义在表达性方面相互关系的深入分析。在本工作中,我们通过提供对强化学习中17种目标规范形式主义的表达性的全面比较,填补了现有文献中的这一空白。我们根据这些形式主义的表达力将其置于一个预序中,并以哈斯图的形式呈现该预序。我们发现不同形式主义存在各种局限性,并且没有一种形式主义同时在表达性和利用当前技术进行优化方面占据主导地位。例如,我们证明正则化强化学习、外非线性马尔可夫奖励、奖励机、线性时态逻辑和平均奖励极限各自都能表达其他形式主义无法表达的目标。我们的发现对策略优化和奖励学习均有启示。首先,我们识别出在实践中指定目标时需要考虑的表达性局限性。其次,我们的结果凸显了未来研究的必要性,即调整奖励学习以适应多种形式主义,因为许多现有奖励学习方法隐含假设期望的目标可以用马尔可夫奖励表达。我们的工作有助于更系统地理解不同强化学习目标规范形式主义的成本与收益。