Reward Machines (RMs) are an established mechanism in Reinforcement Learning (RL) to represent and learn sparse, temporally extended tasks with non-Markovian rewards. RMs rely on high-level information in the form of labels that are emitted by the environment alongside the observation. However, this concept requires manual user input for each environment and task. The user has to create a suitable labeling function that computes the labels. These limitations lead to poor applicability in widely adopted RL frameworks. We propose Symbolic Reward Machines (SRMs) together with the learning algorithms QSRM and LSRM to overcome the limitations of RMs. SRMs consume only the standard output of the environment and process the observation directly through guards that are represented by symbolic formulas. In our evaluation, our SRM methods outperform the baseline RL approaches and generate the same results as the existing RM methods. At the same time, our methods adhere to the widely used environment definition and provide interpretable representations of the task to the user.
翻译:奖励机(RMs)是强化学习(RL)中一种成熟的机制,用于表示和学习具有非马尔可夫奖励的稀疏、时间扩展任务。RMs依赖于环境与观察结果一同发出的标签形式的高层信息。然而,这一概念需要用户为每个环境和任务提供手动输入。用户必须创建一个合适的标签函数来计算标签。这些限制导致其在广泛采用的RL框架中适用性较差。我们提出了符号奖励机(SRMs)以及学习算法QSRM和LSRM,以克服RMs的局限性。SRMs仅消耗环境的标准输出,并通过由符号公式表示的守卫直接处理观察结果。在我们的评估中,我们的SRM方法优于基线RL方法,并产生与现有RM方法相同的结果。同时,我们的方法遵循广泛使用的环境定义,并向用户提供任务的可解释表示。