We propose a new algorithm for model-based distributional reinforcement learning (RL), and prove that it is minimax-optimal for approximating return distributions with a generative model (up to logarithmic factors), resolving an open question of Zhang et al. (2023). Our analysis provides new theoretical results on categorical approaches to distributional RL, and also introduces a new distributional Bellman equation, the stochastic categorical CDF Bellman equation, which we expect to be of independent interest. We also provide an experimental study comparing several model-based distributional RL algorithms, with several takeaways for practitioners.
翻译:我们提出了一种新的基于模型的分布强化学习(RL)算法,并证明了该算法在使用生成模型逼近收益分布时是极小极大最优的(至多相差对数因子),从而解决了Zhang等人(2023)提出的一个开放性问题。我们的分析为分布强化学习中的类别方法提供了新的理论结果,并引入了一个新的分布贝尔曼方程——随机类别累积分布函数贝尔曼方程,预计这一方程具有独立的研究价值。我们还通过实验研究比较了多种基于模型的分布强化学习算法,为实践者提供了若干重要启示。