Reward machines (RMs) are an effective approach for addressing non-Markovian rewards in reinforcement learning (RL) through finite-state machines. Traditional RMs, which label edges with propositional logic formulae, inherit the limited expressivity of propositional logic. This limitation hinders the learnability and transferability of RMs since complex tasks will require numerous states and edges. To overcome these challenges, we propose First-Order Reward Machines ($\texttt{FORM}$s), which use first-order logic to label edges, resulting in more compact and transferable RMs. We introduce a novel method for $\textbf{learning}$ $\texttt{FORM}$s and a multi-agent formulation for $\textbf{exploiting}$ them and facilitate their transferability, where multiple agents collaboratively learn policies for a shared $\texttt{FORM}$. Our experimental results demonstrate the scalability of $\texttt{FORM}$s with respect to traditional RMs. Specifically, we show that $\texttt{FORM}$s can be effectively learnt for tasks where traditional RM learning approaches fail. We also show significant improvements in learning speed and task transferability thanks to the multi-agent learning framework and the abstraction provided by the first-order language.
翻译:奖励机(RMs)是一种通过有限状态机处理强化学习(RL)中非马尔可夫奖励的有效方法。传统的奖励机使用命题逻辑公式标注状态转移边,继承了命题逻辑表达能力的局限性。这一限制阻碍了奖励机的可学习性与可迁移性,因为复杂任务将需要大量状态和转移边。为克服这些挑战,我们提出一阶奖励机($\texttt{FORM}$s),其采用一阶逻辑标注转移边,从而得到更紧凑且更易迁移的奖励机。我们提出一种$\textbf{学习}$$\texttt{FORM}$s的新方法,以及一种用于$\textbf{利用}$奖励机并促进其迁移性的多智能体框架,其中多个智能体协作学习共享$\texttt{FORM}$的策略。实验结果表明$\texttt{FORM}$s相较于传统奖励机具有更好的可扩展性。具体而言,我们证明在传统奖励机学习方法失效的任务中,$\texttt{FORM}$s仍能被有效学习。受益于一阶语言提供的抽象能力及多智能体学习框架,我们在学习速度与任务迁移性方面也展现出显著提升。