Distributional reinforcement learning improves performance by effectively capturing environmental stochasticity, but a comprehensive theoretical understanding of its effectiveness remains elusive. In this paper, we present a regret analysis for distributional reinforcement learning with general value function approximation in a finite episodic Markov decision process setting. We first introduce a key notion of Bellman unbiasedness for a tractable and exactly learnable update via statistical functional dynamic programming. Our theoretical results show that approximating the infinite-dimensional return distribution with a finite number of moment functionals is the only method to learn the statistical information unbiasedly, including nonlinear statistical functionals. Second, we propose a provably efficient algorithm, $\texttt{SF-LSVI}$, achieving a regret bound of $\tilde{O}(d_E H^{\frac{3}{2}}\sqrt{K})$ where $H$ is the horizon, $K$ is the number of episodes, and $d_E$ is the eluder dimension of a function class.
翻译:分布强化学习通过有效捕捉环境随机性来提升性能,但其有效性的完整理论理解仍然缺乏。本文在有限幕马尔可夫决策过程设定下,对具有通用价值函数近似的分布强化学习进行了遗憾分析。我们首先为通过统计泛函动态规划实现可处理且精确可学习的更新,引入了贝尔曼无偏性的关键概念。我们的理论结果表明,用有限数量的矩泛函来逼近无限维回报分布,是唯一能够无偏学习统计信息(包括非线性统计泛函)的方法。其次,我们提出了一种可证明高效的算法 $\texttt{SF-LSVI}$,其遗憾界为 $\tilde{O}(d_E H^{\frac{3}{2}}\sqrt{K})$,其中 $H$ 为时间跨度,$K$ 为幕数,$d_E$ 为函数类的eluder维度。