Offline reinforcement learning (RL) provides a promising solution to learning an agent fully relying on a data-driven paradigm. However, constrained by the limited quality of the offline dataset, its performance is often sub-optimal. Therefore, it is desired to further finetune the agent via extra online interactions before deployment. Unfortunately, offline-to-online RL can be challenging due to two main challenges: constrained exploratory behavior and state-action distribution shift. In view of this, we propose a Simple Unified uNcertainty-Guided (SUNG) framework, which naturally unifies the solution to both challenges with the tool of uncertainty. Specifically, SUNG quantifies uncertainty via a VAE-based state-action visitation density estimator. To facilitate efficient exploration, SUNG presents a practical optimistic exploration strategy to select informative actions with both high value and high uncertainty. Moreover, SUNG develops an adaptive exploitation method by applying conservative offline RL objectives to high-uncertainty samples and standard online RL objectives to low-uncertainty samples to smoothly bridge offline and online stages. SUNG achieves state-of-the-art online finetuning performance when combined with different offline RL methods, across various environments and datasets in D4RL benchmark. Codes are made publicly available in https://github.com/guosyjlu/SUNG.
翻译:离线强化学习(RL)提供了一种完全依赖数据驱动范式学习智能体的有前景的解决方案。然而,受限于离线数据集的质量有限,其性能往往不尽如人意。因此,期望在部署前通过额外的在线交互进一步微调智能体。不幸的是,由于两个主要挑战:受限的探索行为和状态-动作分布偏移,离线到在线强化学习可能具有挑战性。鉴于此,我们提出了一种简单统一的基于不确定性的(SUNG)框架,该框架自然地利用不确定性工具统一了解决这两个挑战的方案。具体而言,SUNG通过一个基于变分自编码器(VAE)的状态-动作访问密度估计器来量化不确定性。为了促进高效探索,SUNG提出了一种实用的乐观探索策略,以选择具有高价值和高不确定性的信息性动作。此外,SUNG开发了一种自适应利用方法,通过对高不确定性样本应用保守的离线RL目标,对低不确定性样本应用标准的在线RL目标,从而平滑地桥接离线和在线阶段。在D4RL基准测试中,结合不同的离线RL方法,SUNG在各种环境和数据集上实现了最先进的在线微调性能。代码已在 https://github.com/guosyjlu/SUNG 公开提供。