This paper presents PROB-IRM, an approach that learns robust reward machines (RMs) for reinforcement learning (RL) agents from noisy execution traces. The key aspect of RM-driven RL is the exploitation of a finite-state machine that decomposes the agent's task into different subtasks. PROB-IRM uses a state-of-the-art inductive logic programming framework robust to noisy examples to learn RMs from noisy traces using the Bayesian posterior degree of beliefs, thus ensuring robustness against inconsistencies. Pivotal for the results is the interleaving between RM learning and policy learning: a new RM is learned whenever the RL agent generates a trace that is believed not to be accepted by the current RM. To speed up the training of the RL agent, PROB-IRM employs a probabilistic formulation of reward shaping that uses the posterior Bayesian beliefs derived from the traces. Our experimental analysis shows that PROB-IRM can learn (potentially imperfect) RMs from noisy traces and exploit them to train an RL agent to solve its tasks successfully. Despite the complexity of learning the RM from noisy traces, agents trained with PROB-IRM perform comparably to agents provided with handcrafted RMs.
翻译:本文提出PROB-IRM方法,该方法能够从噪声执行轨迹中为强化学习智能体学习鲁棒的奖励机。奖励机驱动强化学习的核心在于利用有限状态机将智能体任务分解为不同子任务。PROB-IRM采用一种对噪声示例具有鲁棒性的前沿归纳逻辑编程框架,通过贝叶斯后验置信度从噪声轨迹中学习奖励机,从而确保对不一致性的鲁棒性。实现该结果的关键在于奖励机学习与策略学习的交错进行:每当强化学习智能体生成的轨迹被判定为不被当前奖励机接受时,即学习新的奖励机。为加速强化学习智能体的训练,PROB-IRM采用基于概率的奖励塑形方法,该方法利用从轨迹推导出的贝叶斯后验置信度。实验分析表明,PROB-IRM能够从噪声轨迹中学习(可能存在缺陷的)奖励机,并利用它们成功训练强化学习智能体完成任务。尽管从噪声轨迹学习奖励机具有复杂性,但使用PROB-IRM训练的智能体性能与使用手工设计奖励机的智能体相当。