Recent offline meta-reinforcement learning (meta-RL) methods typically utilize task-dependent behavior policies (e.g., training RL agents on each individual task) to collect a multi-task dataset. However, these methods always require extra information for fast adaptation, such as offline context for testing tasks. To address this problem, we first formally characterize a unique challenge in offline meta-RL: transition-reward distribution shift between offline datasets and online adaptation. Our theory finds that out-of-distribution adaptation episodes may lead to unreliable policy evaluation and that online adaptation with in-distribution episodes can ensure adaptation performance guarantee. Based on these theoretical insights, we propose a novel adaptation framework, called In-Distribution online Adaptation with uncertainty Quantification (IDAQ), which generates in-distribution context using a given uncertainty quantification and performs effective task belief inference to address new tasks. We find a return-based uncertainty quantification for IDAQ that performs effectively. Experiments show that IDAQ achieves state-of-the-art performance on the Meta-World ML1 benchmark compared to baselines with/without offline adaptation.
翻译:近期离线元强化学习方法通常利用任务相关的行为策略(例如在每个独立任务上训练强化学习智能体)来收集多任务数据集。然而,这些方法始终需要额外信息以实现快速自适应,例如测试任务的离线上下文。为解决此问题,我们首先正式刻画了离线元强化学习中的独特挑战:离线数据集与在线自适应之间的转移-奖励分布偏移。理论分析发现,分布外自适应片段可能导致不可靠的策略评估,而采用分布内片段的在线自适应可确保自适应性能保障。基于这些理论洞见,我们提出新型自适应框架——基于不确定性量化的分布内在线自适应(IDAQ),该框架利用给定的不确定性量化生成分布内上下文,并通过有效的任务信念推断处理新任务。我们为IDAQ设计了一种基于回报的有效不确定性量化方法。实验表明,相较于有/无离线自适应能力的基线方法,IDAQ在Meta-World ML1基准上取得了最先进性能。