Reward machines have shown great promise at capturing non-Markovian reward functions for learning tasks that involve complex action sequencing. However, no algorithm currently exists for learning reward machines with realistic weak feedback in the form of preferences. We contribute REMAP, a novel algorithm for learning reward machines from preferences, with correctness and termination guarantees. REMAP introduces preference queries in place of membership queries in the L* algorithm, and leverages a symbolic observation table along with unification and constraint solving to narrow the hypothesis reward machine search space. In addition to the proofs of correctness and termination for REMAP, we present empirical evidence measuring correctness: how frequently the resulting reward machine is isomorphic under a consistent yet inexact teacher, and the regret between the ground truth and learned reward machines.
翻译:奖赏机在捕获涉及复杂动作序列的非马尔可夫奖赏函数方面展现出巨大潜力。然而,目前尚无算法能够利用偏好形式的现实弱反馈来学习奖赏机。我们提出了REMAP算法——一种新颖的从偏好中学习奖赏机的方法,并确保其正确性与终止性。REMAP在L*算法中用偏好查询替代了成员查询,并借助符号化观察表结合统一化与约束求解来缩小假设奖赏机的搜索空间。除REMAP正确性与终止性证明外,我们还提供了衡量正确性的实证证据:在一致但不精确的教师模型下,所学习的奖赏机与真实奖赏机同构的频率,以及两者之间的累积遗憾值。