A key challenge in reinforcement learning (RL) is reward (mis)specification, whereby imprecisely defined reward functions can result in unintended, possibly harmful, behaviours. Indeed, reward functions in RL are typically treated as black-box mappings from state-action pairs to scalar values. While effective in many settings, this approach provides no information about why rewards are given, which can hinder learning and interpretability. Reward Machines address this issue by representing reward functions as finite state automata, enabling the specification of structured, non-Markovian reward functions. However, their expressivity is typically bounded by regular languages, leaving them unable to capture more complex behaviours such as counting or parametrised conditions. In this work, we build on the Runtime Monitoring Language (RML) to develop a novel class of language-based Reward Machines. By leveraging the built-in memory of RML, our approach can specify reward functions for non-regular, non-Markovian tasks. We demonstrate the expressiveness of our approach through experiments, highlighting additional advantages in flexible event-handling and task specification over existing Reward Machine-based methods.
翻译:强化学习(RL)中的一个关键挑战是奖励(错误)设定问题,即不精确定义的奖励函数可能导致意外且可能有害的行为。实际上,RL中的奖励函数通常被视为从状态-动作对到标量值的黑盒映射。尽管在许多场景下有效,这种方法未提供奖励赋予的原因信息,从而可能阻碍学习过程与可解释性。奖励机通过将奖励函数表示为有限状态自动机来解决这一问题,使得能够定义结构化、非马尔可夫性的奖励函数。然而,其表达能力通常受限于正则语言,使其无法捕捉更复杂的行为,如计数或参数化条件。在本工作中,我们基于运行时监控语言(RML)构建了一类新型的基于语言的奖励机。通过利用RML的内置内存,我们的方法能够为非正则、非马尔可夫性任务定义奖励函数。我们通过实验展示了该方法的表达能力,并强调了其在灵活事件处理与任务定义方面相较于现有基于奖励机方法的额外优势。