Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning

Offline pretraining with a static dataset followed by online fine-tuning (offline-to-online, or OtO) is a paradigm that is well matched to a real-world RL deployment process: in few real settings would one deploy an offline policy with no test runs and tuning. In this scenario, we aim to find the best-performing policy within a limited budget of online interactions. Previous work in the OtO setting has focused on correcting for bias introduced by the policy-constraint mechanisms of offline RL algorithms. Such constraints keep the learned policy close to the behavior policy that collected the dataset, but this unnecessarily limits policy performance if the behavior policy is far from optimal. Instead, we forgo policy constraints and frame OtO RL as an exploration problem: we must maximize the benefit of the online data-collection. We study major online RL exploration paradigms, adapting them to work well with the OtO setting. These adapted methods contribute several strong baselines. Also, we introduce an algorithm for planning to go out of distribution (PTGOOD), which targets online exploration in relatively high-reward regions of the state-action space unlikely to be visited by the behavior policy. By leveraging concepts from the Conditional Entropy Bottleneck, PTGOOD encourages data collected online to provide new information relevant to improving the final deployment policy. In that way the limited interaction budget is used effectively. We show that PTGOOD significantly improves agent returns during online fine-tuning and finds the optimal policy in as few as 10k online steps in Walker and in as few as 50k in complex control tasks like Humanoid. Also, we find that PTGOOD avoids the suboptimal policy convergence that many of our baselines exhibit in several environments.

翻译：离线预训练（基于静态数据集）与在线微调（即离线到在线，OtO）相结合，是一种契合实际强化学习部署流程的范式：在实际场景中，几乎不会有人未经验证测试就直接部署离线策略。在此场景下，我们的目标是在有限的在线交互预算内，找到性能最优的策略。现有OtO研究主要聚焦于纠正离线强化学习算法中策略约束机制引入的偏差。此类约束迫使学习策略贴近收集数据的行为策略，但若行为策略远非最优，则非必要地限制了策略性能。为此，我们摒弃策略约束，将OtO强化学习重构为探索问题：需最大化在线数据采集的收益。我们研究了主要的在线强化学习探索范式，并使其适配OtO场景。这些适配方法构建了多个强有力的基线模型。同时，我们提出了一种面向离群分布的规划算法（PTGOOD），该算法专门针对行为策略极可能未访问的、状态-动作空间中高奖励区域的在线探索。通过融合条件熵瓶颈的概念，PTGOOD引导在线采集的数据提供与改进最终部署策略相关的新信息，从而高效利用有限的交互预算。实验表明，PTGOOD显著提升了在线微调期间智能体的回报，在Walker环境中仅需1万步在线交互即可找到最优策略，在如Humanoid的复杂控制任务中仅需5万步。此外，我们发现PTGOOD能避免多个基线模型在若干环境中出现的次优策略收敛问题。