Offline reinforcement learning (RL) is a powerful approach for data-driven decision-making and control. Compared to model-free methods, offline model-based reinforcement learning (MBRL) explicitly learns world models from a static dataset and uses them as surrogate simulators, improving the data efficiency and enabling the learned policy to potentially generalize beyond the dataset support. However, there could be various MDPs that behave identically on the offline dataset and dealing with the uncertainty about the true MDP can be challenging. In this paper, we propose modeling offline MBRL as a Bayes Adaptive Markov Decision Process (BAMDP), which is a principled framework for addressing model uncertainty. We further propose a novel Bayes Adaptive Monte-Carlo planning algorithm capable of solving BAMDPs in continuous state and action spaces with stochastic transitions. This planning process is based on Monte Carlo Tree Search and can be integrated into offline MBRL as a policy improvement operator in policy iteration. Our "RL + Search" framework follows in the footsteps of superhuman AIs like AlphaZero, improving on current offline MBRL methods by incorporating more computation input. The proposed algorithm significantly outperforms state-of-the-art offline RL methods on twelve D4RL MuJoCo tasks and three challenging, stochastic tokamak control tasks. The codebase is available at: https://github.com/LucasCJYSDL/Offline-RL-Kit.
翻译:离线强化学习(RL)是一种强大的数据驱动决策与控制方法。与无模型方法相比,离线基于模型的强化学习(MBRL)从静态数据集中显式地学习世界模型,并将其用作替代模拟器,从而提高了数据效率,并使学习到的策略有可能泛化到数据集支持范围之外。然而,可能存在多种马尔可夫决策过程(MDP)在离线数据集上表现出相同的行为,处理关于真实MDP的不确定性可能具有挑战性。在本文中,我们提出将离线MBRL建模为一个贝叶斯自适应马尔可夫决策过程(BAMDP),这是一个解决模型不确定性的原则性框架。我们进一步提出了一种新颖的贝叶斯自适应蒙特卡洛规划算法,该算法能够求解具有随机转移的连续状态和动作空间中的BAMDP。此规划过程基于蒙特卡洛树搜索,并可以作为策略改进算子集成到离线MBRL的策略迭代中。我们的“RL + 搜索”框架遵循了像AlphaZero这样的超人类AI的脚步,通过融入更多的计算输入来改进当前的离线MBRL方法。所提出的算法在十二个D4RL MuJoCo任务和三个具有挑战性的、随机的托卡马克控制任务上显著优于最先进的离线RL方法。代码库可在以下网址获取:https://github.com/LucasCJYSDL/Offline-RL-Kit。