一种简单统一的基于不确定性的离线到在线强化学习框架 (A Simple Unified Uncertainty-Guided Framework for Offline-to-Online Reinforcement Learning)

from arxiv, The final published version is available at IEEE Xplore: https://ieeexplore.ieee.org/abstract/document/11267513/. We correct the GitHub repo url in this version

Offline reinforcement learning (RL) provides a promising solution to learning an agent fully relying on a data-driven paradigm. However, constrained by the limited quality of the offline dataset, its performance is often sub-optimal. Therefore, it is desired to further finetune the agent via extra online interactions before deployment. Unfortunately, offline-to-online RL can be challenging due to two main challenges: constrained exploratory behavior and state-action distribution shift. In view of this, we propose a Simple Unified uNcertainty-Guided (SUNG) framework, which naturally unifies the solution to both challenges with the tool of uncertainty. Specifically, SUNG quantifies uncertainty via a VAE-based state-action visitation density estimator. To facilitate efficient exploration, SUNG presents a practical optimistic exploration strategy to select informative actions with both high value and high uncertainty. Moreover, SUNG develops an adaptive exploitation method by applying conservative offline RL objectives to high-uncertainty samples and standard online RL objectives to low-uncertainty samples to smoothly bridge offline and online stages. SUNG achieves state-of-the-art online finetuning performance when combined with different offline RL methods, across various environments and datasets in D4RL benchmark. Codes are made publicly available in https://github.com/guosyjlu/SUNG.

翻译：离线强化学习（RL）提供了一种完全依赖数据驱动范式学习智能体的有前景的解决方案。然而，受限于离线数据集的质量有限，其性能往往不尽如人意。因此，期望在部署前通过额外的在线交互进一步微调智能体。不幸的是，由于两个主要挑战：受限的探索行为和状态-动作分布偏移，离线到在线强化学习可能具有挑战性。鉴于此，我们提出了一种简单统一的基于不确定性的（SUNG）框架，该框架自然地利用不确定性工具统一了解决这两个挑战的方案。具体而言，SUNG通过一个基于变分自编码器（VAE）的状态-动作访问密度估计器来量化不确定性。为了促进高效探索，SUNG提出了一种实用的乐观探索策略，以选择具有高价值和高不确定性的信息性动作。此外，SUNG开发了一种自适应利用方法，通过对高不确定性样本应用保守的离线RL目标，对低不确定性样本应用标准的在线RL目标，从而平滑地桥接离线和在线阶段。在D4RL基准测试中，结合不同的离线RL方法，SUNG在各种环境和数据集上实现了最先进的在线微调性能。代码已在 https://github.com/guosyjlu/SUNG 公开提供。