Offline reinforcement learning (RL) is crucial for real-world applications where exploration can be costly or unsafe. However, offline learned policies are often suboptimal, and further online fine-tuning is required. In this paper, we tackle the fundamental dilemma of offline-to-online fine-tuning: if the agent remains pessimistic, it may fail to learn a better policy, while if it becomes optimistic directly, performance may suffer from a sudden drop. We show that Bayesian design principles are crucial in solving such a dilemma. Instead of adopting optimistic or pessimistic policies, the agent should act in a way that matches its belief in optimal policies. Such a probability-matching agent can avoid a sudden performance drop while still being guaranteed to find the optimal policy. Based on our theoretical findings, we introduce a novel algorithm that outperforms existing methods on various benchmarks, demonstrating the efficacy of our approach. Overall, the proposed approach provides a new perspective on offline-to-online RL that has the potential to enable more effective learning from offline data.
翻译:离线强化学习(RL)对于探索成本高昂或不安全的实际应用至关重要。然而,离线学习的策略往往不是最优的,需要进一步的在线微调。本文解决了离线到在线微调的基本困境:如果智能体保持悲观,可能无法学习到更好的策略;而如果直接变得乐观,性能可能会因突然下降而受损。我们证明,贝叶斯设计原则对于解决这一困境至关重要。智能体不应采取乐观或悲观策略,而应根据其对最优策略的信念采取行动。这种概率匹配的智能体既能避免性能突然下降,又能保证找到最优策略。基于我们的理论发现,我们提出了一种新算法,该算法在多个基准测试中优于现有方法,证明了我们方法的有效性。总体而言,所提出的方法为离线到在线强化学习提供了新的视角,有望实现从离线数据中进行更有效的学习。