A prospective application of offline reinforcement learning (RL) involves initializing a pre-trained policy using existing static datasets for subsequent online fine-tuning. However, direct fine-tuning of the offline pre-trained policy often results in sub-optimal performance. A primary reason is that offline conservative methods diminish the agent's capability of exploration, thereby impacting online fine-tuning performance. To enhance exploration during online fine-tuning and thus enhance the overall online fine-tuning performance, we introduce a generalized reward augmentation framework called Sample Efficient Reward Augmentation (SERA). SERA aims to improve the performance of online fine-tuning by designing intrinsic rewards that encourage the agent to explore. Specifically, it implicitly implements State Marginal Matching (SMM) and penalizes out-of-distribution (OOD) state actions, thus encouraging agents to cover the target state density, and achieving better online fine-tuning results. Additionally, SERA can be effortlessly plugged into various RL algorithms to improve online fine-tuning and ensure sustained asymptotic improvement, showing the versatility as well as the effectiveness of SERA. Moreover, extensive experimental results will demonstrate that when conducting offline-to-online problems, SERA consistently and effectively enhances the performance of various offline algorithms.
翻译:离线强化学习(RL)的一个前瞻性应用是利用现有静态数据集初始化预训练策略,用于后续的在线微调。然而,直接对离线预训练策略进行微调往往会导致次优性能。其主要原因是离线保守方法削弱了智能体的探索能力,从而影响在线微调性能。为了增强在线微调过程中的探索能力,进而提升整体在线微调性能,我们提出了一种名为样本高效奖励增强(SERA)的通用奖励增强框架。SERA通过设计内在奖励来鼓励智能体进行探索,从而提升在线微调性能。具体而言,它隐式实现了状态边缘匹配(SMM),并对分布外(OOD)状态动作施加惩罚,从而鼓励智能体覆盖目标状态密度,进而获得更好的在线微调结果。此外,SERA可以轻松嵌入各种RL算法中,以改进在线微调并确保持续的渐近性能提升,展示了SERA的通用性和有效性。进一步的实验结果表明,在解决离线到在线问题时,SERA能够一致且有效地提升各种离线算法的性能。