Generalization and sample efficiency have been long-standing issues concerning reinforcement learning, and thus the field of Offline Meta-Reinforcement Learning~(OMRL) has gained increasing attention due to its potential of solving a wide range of problems with static and limited offline data. Existing OMRL methods often assume sufficient training tasks and data coverage to apply contrastive learning to extract task representations. However, such assumptions are not applicable in several real-world applications and thus undermine the generalization ability of the representations. In this paper, we consider OMRL with two types of data limitations: limited training tasks and limited behavior diversity and propose a novel algorithm called GENTLE for learning generalizable task representations in the face of data limitations. GENTLE employs Task Auto-Encoder~(TAE), which is an encoder-decoder architecture to extract the characteristics of the tasks. Unlike existing methods, TAE is optimized solely by reconstruction of the state transition and reward, which captures the generative structure of the task models and produces generalizable representations when training tasks are limited. To alleviate the effect of limited behavior diversity, we consistently construct pseudo-transitions to align the data distribution used to train TAE with the data distribution encountered during testing. Empirically, GENTLE significantly outperforms existing OMRL methods on both in-distribution tasks and out-of-distribution tasks across both the given-context protocol and the one-shot protocol.
翻译:泛化能力和样本效率一直是强化学习领域的长期难题,因此离线元强化学习(OMRL)因其在静态有限离线数据中解决广泛问题的潜力而受到越来越多的关注。现有OMRL方法通常假设充足的训练任务和数据覆盖范围,以应用对比学习提取任务表示。然而,这些假设在许多现实场景中并不成立,从而削弱了表示的泛化能力。本文考虑两种数据受限情况下的OMRL:训练任务有限和行为多样性有限,并提出了一种名为GENTLE的新算法,用于学习数据受限场景下的泛化任务表示。GENTLE采用任务自编码器(TAE),这是一种编码器-解码器架构,用于提取任务特征。与现有方法不同,TAE仅通过状态转移和奖励的重构进行优化,从而捕捉任务模型的生成结构,并在训练任务有限时产生泛化表示。为缓解行为多样性有限的影响,我们持续构建伪转移,以对齐训练TAE时的数据分布与测试时遇到的数据分布。实验表明,无论是在给定上下文协议还是单次协议下,GENTLE在分布内任务和分布外任务上均显著优于现有OMRL方法。