A central challenge in multi-task reinforcement learning (RL) is to train generalist policies capable of performing tasks not seen during training. To facilitate such generalization, linear temporal logic (LTL) has recently emerged as a powerful formalism for specifying structured, temporally extended tasks to RL agents. While existing approaches to LTL-guided multi-task RL demonstrate successful generalization across LTL specifications, they are unable to generalize to unseen vocabularies of propositions (or "symbols"), which describe high-level events in LTL. We present PlatoLTL, a novel approach that enables policies to zero-shot generalize not only compositionally across LTL formula structures, but also parametrically across propositions. We achieve this by treating propositions as instances of parameterized predicates rather than discrete symbols, allowing policies to learn shared structure across related propositions. We propose a novel architecture that embeds and composes predicates to represent LTL specifications, and demonstrate successful zero-shot generalization to novel propositions and tasks across challenging environments.
翻译:多任务强化学习(RL)中的一个核心挑战是训练能够执行训练期间未见任务的通用策略。为了促进这种泛化,线性时序逻辑(LTL)最近已成为一种强大的形式化方法,用于向RL智能体指定结构化、时间扩展的任务。虽然现有的LTL引导多任务RL方法展示了跨LTL规范的成功泛化,但它们无法泛化到未见的命题(或“符号”)词汇表,这些命题描述了LTL中的高层事件。我们提出了PlatoLTL,这是一种新颖的方法,使策略不仅能够在LTL公式结构上组合式地零样本泛化,还能在命题上参数化地零样本泛化。我们通过将命题视为参数化谓词的实例而非离散符号来实现这一点,从而使策略能够学习相关命题之间的共享结构。我们提出了一种新颖的架构,该架构嵌入并组合谓词以表示LTL规范,并在多个具有挑战性的环境中展示了成功泛化到新命题和任务的零样本能力。