As a marriage between offline RL and meta-RL, the advent of offline meta-reinforcement learning (OMRL) has shown great promise in enabling RL agents to multi-task and quickly adapt while acquiring knowledge safely. Among which, context-based OMRL (COMRL) as a popular paradigm, aims to learn a universal policy conditioned on effective task representations. In this work, by examining several key milestones in the field of COMRL, we propose to integrate these seemingly independent methodologies into a unified framework. Most importantly, we show that the pre-existing COMRL algorithms are essentially optimizing the same mutual information objective between the task variable $M$ and its latent representation $Z$ by implementing various approximate bounds. Such theoretical insight offers ample design freedom for novel algorithms. As demonstrations, we propose a supervised and a self-supervised implementation of $I(Z; M)$, and empirically show that the corresponding optimization algorithms exhibit remarkable generalization across a broad spectrum of RL benchmarks, context shift scenarios, data qualities and deep learning architectures. This work lays the information theoretic foundation for COMRL methods, leading to a better understanding of task representation learning in the context of reinforcement learning.
翻译:作为离线强化学习(offline RL)与元强化学习(meta-RL)的结合,离线元强化学习(OMRL)的出现展现出巨大潜力,它能使强化学习智能体在安全获取知识的同时执行多任务并快速适应。其中,基于上下文的离线元强化学习(COMRL)作为一种主流范式,旨在学习一个以有效任务表示为条件的通用策略。本文通过审视COMRL领域中的若干关键里程碑,提出将这些看似独立的方法整合到一个统一的框架中。最重要的是,我们证明现有的COMRL算法本质上都是通过实现各种近似边界,在优化任务变量$M$与其潜在表示$Z$之间的相同互信息目标$I(Z; M)$。这一理论见解为设计新算法提供了充分的设计自由度。作为示例,我们提出了$I(Z; M)$的一种监督式与一种自监督式实现方法,并通过实验证明,相应的优化算法在广泛的强化学习基准测试、上下文偏移场景、数据质量以及深度学习架构中均表现出卓越的泛化能力。本研究为COMRL方法奠定了信息论基础,有助于更好地理解强化学习背景下的任务表示学习。