Offline meta-reinforcement learning (OMRL) utilizes pre-collected offline datasets to enhance the agent's generalization ability on unseen tasks. However, the context shift problem arises due to the distribution discrepancy between the contexts used for training (from the behavior policy) and testing (from the exploration policy). The context shift problem leads to incorrect task inference and further deteriorates the generalization ability of the meta-policy. Existing OMRL methods either overlook this problem or attempt to mitigate it with additional information. In this paper, we propose a novel approach called Context Shift Reduction for OMRL (CSRO) to address the context shift problem with only offline datasets. The key insight of CSRO is to minimize the influence of policy in context during both the meta-training and meta-test phases. During meta-training, we design a max-min mutual information representation learning mechanism to diminish the impact of the behavior policy on task representation. In the meta-test phase, we introduce the non-prior context collection strategy to reduce the effect of the exploration policy. Experimental results demonstrate that CSRO significantly reduces the context shift and improves the generalization ability, surpassing previous methods across various challenging domains.
翻译:离线元强化学习利用预先收集的离线数据集来增强智能体在未见任务上的泛化能力。然而,由于训练阶段(来自行为策略)和测试阶段(来自探索策略)所使用的上下文之间存在分布差异,产生了上下文偏移问题。上下文偏移问题会导致任务推断错误,进而削弱元策略的泛化能力。现有离线元强化学习方法要么忽略该问题,要么试图通过额外信息来缓解。本文提出一种名为CSRO(Context Shift Reduction for OMRL)的新方法,仅利用离线数据集解决上下文偏移问题。CSRO的核心思想是在元训练和元测试阶段最小化策略对上下文的影响。在元训练阶段,我们设计了最大-最小互信息表示学习机制,以减弱行为策略对任务表示的影响;在元测试阶段,我们引入无先验上下文收集策略来降低探索策略的影响。实验结果表明,CSRO能显著减少上下文偏移并提升泛化能力,在多个具有挑战性的领域均优于先前方法。