Despite advances in Reinforcement Learning, many sequential decision making tasks remain prohibitively expensive and impractical to learn. Recently, approaches that automatically generate reward functions from logical task specifications have been proposed to mitigate this issue; however, they scale poorly on long-horizon tasks (i.e., tasks where the agent needs to perform a series of correct actions to reach the goal state, considering future transitions while choosing an action). Employing a curriculum (a sequence of increasingly complex tasks) further improves the learning speed of the agent by sequencing intermediate tasks suited to the learning capacity of the agent. However, generating curricula from the logical specification still remains an unsolved problem. To this end, we propose AGCL, Automaton-guided Curriculum Learning, a novel method for automatically generating curricula for the target task in the form of Directed Acyclic Graphs (DAGs). AGCL encodes the specification in the form of a deterministic finite automaton (DFA), and then uses the DFA along with the Object-Oriented MDP (OOMDP) representation to generate a curriculum as a DAG, where the vertices correspond to tasks, and edges correspond to the direction of knowledge transfer. Experiments in gridworld and physics-based simulated robotics domains show that the curricula produced by AGCL achieve improved time-to-threshold performance on a complex sequential decision-making problem relative to state-of-the-art curriculum learning (e.g, teacher-student, self-play) and automaton-guided reinforcement learning baselines (e.g, Q-Learning for Reward Machines). Further, we demonstrate that AGCL performs well even in the presence of noise in the task's OOMDP description, and also when distractor objects are present that are not modeled in the logical specification of the tasks' objectives.
翻译:尽管强化学习取得了进展,许多序列决策任务仍然因成本过高且不切实际而难以学习。近期,有研究提出从逻辑任务规范中自动生成奖励函数的方法来缓解这一问题,然而这些方法在长时域任务(即智能体需在行动选择时考虑未来状态转移、通过一系列正确操作到达目标状态的任务)中扩展性较差。采用课程(即复杂度递增的任务序列)通过安排适合智能体学习能力的中间任务,可进一步加速其学习过程,但如何从逻辑规范中自动生成课程仍是一个未解决的问题。为此,我们提出AGCL(自动机引导的课程学习)——一种以有向无环图(DAG)形式为目标任务自动生成课程的新方法。AGCL将任务规范编码为确定性有限自动机(DFA),并利用DFA与面向对象MDP(OOMDP)表征,生成以任务为顶点、知识迁移方向为边的课程DAG。在网格世界与基于物理的仿真机器人领域实验中,与现有最优的课程学习方法(如师生学习、自我博弈)及自动机引导的强化学习基线(如奖励机器的Q学习)相比,AGCL生成的课程在复杂序列决策问题的收敛时间性能上表现更优。此外,我们证明即使在任务OOMDP描述存在噪声,或存在未建模于任务目标逻辑规范中的干扰物体时,AGCL仍能保持良好性能。