In this paper, we propose a model-based offline reinforcement learning method that integrates count-based conservatism, named $\texttt{Count-MORL}$. Our method utilizes the count estimates of state-action pairs to quantify model estimation error, marking the first algorithm of demonstrating the efficacy of count-based conservatism in model-based offline deep RL to the best of our knowledge. For our proposed method, we first show that the estimation error is inversely proportional to the frequency of state-action pairs. Secondly, we demonstrate that the learned policy under the count-based conservative model offers near-optimality performance guarantees. Through extensive numerical experiments, we validate that $\texttt{Count-MORL}$ with hash code implementation significantly outperforms existing offline RL algorithms on the D4RL benchmark datasets. The code is accessible at $\href{https://github.com/oh-lab/Count-MORL}{https://github.com/oh-lab/Count-MORL}$.
翻译:本文提出一种融合计数保守性的模型驱动离线强化学习方法,命名为$\texttt{Count-MORL}$。该方法利用状态-动作对的计数估计值来量化模型估计误差,据我们所知,这是首个验证计数保守性在模型驱动离线深度强化学习有效性的算法。针对所提方法,我们首先证明估计误差与状态-动作对的访问频率成反比;其次,我们证明基于计数保守模型学习到的策略具有近乎最优的性能保证。通过大量数值实验验证,采用哈希编码实现的$\texttt{Count-MORL}$在D4RL基准数据集上显著优于现有离线强化学习算法。代码开源地址为$\href{https://github.com/oh-lab/Count-MORL}{https://github.com/oh-lab/Count-MORL}$。