A core ambition of reinforcement learning (RL) is the creation of agents capable of rapid learning in novel tasks. Meta-RL aims to achieve this by directly learning such agents. One category of meta-RL methods, called black box methods, does so by training off-the-shelf sequence models end-to-end. In contrast, another category of methods have been developed that explicitly infer a posterior distribution over the unknown task. These methods generally have distinct objectives and sequence models designed to enable task inference, and so are known as task inference methods. However, recent evidence suggests that task inference objectives are unnecessary in practice. Nonetheless, it remains unclear whether task inference sequence models are beneficial even when task inference objectives are not. In this paper, we present strong evidence that task inference sequence models are still beneficial. In particular, we investigate sequence models with permutation invariant aggregation, which exploit the fact that, due to the Markov property, the task posterior does not depend on the order of data. We empirically confirm the advantage of permutation invariant sequence models without the use of task inference objectives. However, we also find, surprisingly, that there are multiple conditions under which permutation variance remains useful. Therefore, we propose SplAgger, which uses both permutation variant and invariant components to achieve the best of both worlds, outperforming all baselines on continuous control and memory environments.
翻译:强化学习(RL)的核心目标之一是创建能够在新型任务中快速学习的智能体。元强化学习(Meta-RL)旨在通过直接学习此类智能体来实现这一目标。其中一类元强化学习方法被称为黑盒方法,通过端到端训练现成的序列模型来实现。相比之下,另一类方法则被开发用于显式推断未知任务的后验分布。这些方法通常具有独特的优化目标与专为任务推断设计的序列模型,因此被称为任务推断方法。然而,近期证据表明任务推断目标在实践中并非必要。尽管如此,即便在不使用任务推断目标的情况下,任务推断序列模型是否仍具优势尚不明确。本文提供了有力证据表明任务推断序列模型依然有益。具体而言,我们研究了具有置换不变聚合特性的序列模型,该模型利用马尔可夫性带来的任务后验不依赖数据顺序的特性。我们通过实验证实了在不使用任务推断目标时置换不变序列模型的优势。但令人意外的是,我们发现存在多种条件下置换变体仍具效用。为此,我们提出SplAgger方法,通过同时使用置换变体与置换不变组件实现双赢,在连续控制与记忆环境中全面超越所有基线方法。