Offline model selection (OMS), that is, choosing the best policy from a set of many policies given only logged data, is crucial for applying offline RL in real-world settings. One idea that has been extensively explored is to select policies based on the mean squared Bellman error (MSBE) of the associated Q-functions. However, previous work has struggled to obtain adequate OMS performance with Bellman errors, leading many researchers to abandon the idea. Through theoretical and empirical analyses, we elucidate why previous work has seen pessimistic results with Bellman errors and identify conditions under which OMS algorithms based on Bellman errors will perform well. Moreover, we develop a new estimator of the MSBE that is more accurate than prior methods and obtains impressive OMS performance on diverse discrete control tasks, including Atari games. We open-source our data and code to enable researchers to conduct OMS experiments more easily.
翻译:离线模型选择(OMS),即仅基于记录数据从一组策略中选出最佳策略,对于在真实环境中应用离线强化学习至关重要。一个被广泛探索的思路是基于与Q函数相关的均方贝尔曼误差(MSBE)来筛选策略。然而,先前的研究难以通过贝尔曼误差获得理想的OMS性能,导致许多研究者放弃了这一思路。通过理论和实证分析,我们阐明了为何先前研究在使用贝尔曼误差时得到悲观结果,并识别了基于贝尔曼误差的OMS算法能够良好运行的条件。此外,我们提出了一种比先前方法更准确的MSBE新估计器,在包括Atari游戏在内的多种离散控制任务中获得令人瞩目的OMS性能。我们开源了数据和代码,以便研究人员更便捷地进行OMS实验。