Reward machines (RMs) inform reinforcement learning agents about the reward structure of the environment, enabling support for non-Markovian tasks and improving sample efficiency. However, learning with RMs is ill-suited for long-horizon problems where subtasks can be completed in any order. In such cases, the amount of information to learn increases exponentially with the number of unordered subtasks. We address this issue by introducing three generalisations of RMs: (1) Numeric RMs allow users to express complex tasks in a compact form. (2) In agenda RMs, states are associated with an agenda that tracks the remaining subtasks to complete. (3) Coupled RMs have coupled states associated with each subtask in the agenda. In addition, we introduce QCoRM, a new task-decomposition Q-learning-based algorithm that leverages coupled RMs and preserves global optimality guarantees in tabular settings. Our experiments across four domains -- featuring both discrete and continuous action and state spaces -- demonstrate that QCoRM scales better than baseline algorithms for long-horizon problems with unordered subtasks.
翻译:暂无翻译