We propose a new algorithm for model-based distributional reinforcement learning (RL), and prove that it is minimax-optimal for approximating return distributions with a generative model (up to logarithmic factors), resolving an open question of Zhang et al. (2023). Our analysis provides new theoretical results on categorical approaches to distributional RL, and also introduces a new distributional Bellman equation, the stochastic categorical CDF Bellman equation, which we expect to be of independent interest. We also provide an experimental study comparing several model-based distributional RL algorithms, with several takeaways for practitioners.
翻译:我们提出了一种新的基于模型的分布强化学习算法,并证明其在生成模型框架下对回报分布逼近具有极小极大最优性(至多对数因子),解决了张等人(2023)提出的一个开放性问题。我们的分析为分布强化学习的分类方法提供了新的理论结果,同时引入了一个新的分布贝尔曼方程——随机分类累积分布函数贝尔曼方程,该方程预期具有独立的研究价值。我们还通过实验研究比较了多种基于模型的分布强化学习算法,为实践者提供了若干重要启示。