Near Minimax-Optimal Distributional Temporal Difference Algorithms and The Freedman Inequality in Hilbert Spaces

Distributional reinforcement learning (DRL) has achieved empirical success in various domains. One of the core tasks in the field of DRL is distributional policy evaluation, which involves estimating the return distribution $\eta^\pi$ for a given policy $\pi$. The distributional temporal difference (TD) algorithm has been accordingly proposed, which is an extension of the temporal difference algorithm in the classic RL literature. In the tabular case, \citet{rowland2018analysis} and \citet{rowland2023analysis} proved the asymptotic convergence of two instances of distributional TD, namely categorical temporal difference algorithm (CTD) and quantile temporal difference algorithm (QTD), respectively. In this paper, we go a step further and analyze the finite-sample performance of distributional TD. To facilitate theoretical analysis, we propose a non-parametric distributional TD algorithm (NTD). For a $\gamma$-discounted infinite-horizon tabular Markov decision process, we show that for NTD we need $\tilde{O}\left(\frac{1}{\varepsilon^{2p}(1-\gamma)^{2p+1}}\right)$ iterations to achieve an $\varepsilon$-optimal estimator with high probability, when the estimation error is measured by the $p$-Wasserstein distance. This sample complexity bound is minimax optimal (up to logarithmic factors) in the case of the $1$-Wasserstein distance. To achieve this, we establish a novel Freedman's inequality in Hilbert spaces, which would be of independent interest. In addition, we revisit CTD, showing that the same non-asymptotic convergence bounds hold for CTD in the case of the $p$-Wasserstein distance.

翻译：分布强化学习（DRL）已在多个领域取得了实证成功。该领域的核心任务之一是分布策略评估，即估计给定策略 $\pi$ 下的回报分布 $\eta^\pi$。为此，分布时序差分（TD）算法应运而生，它是经典强化学习文献中时序差分算法的扩展。在表格情形下，\citet{rowland2018analysis} 和 \citet{rowland2023analysis} 分别证明了分布TD的两种实例——分类时序差分算法（CTD）和分位数时序差分算法（QTD）的渐近收敛性。本文进一步分析分布TD的有限样本性能。为便于理论分析，我们提出了一种非参数分布TD算法（NTD）。对于 $\gamma$ 折扣无限水平表格马尔可夫决策过程，我们证明，当使用 $p$-Wasserstein距离衡量估计误差时，NTD需要 $\tilde{O}\left(\frac{1}{\varepsilon^{2p}(1-\gamma)^{2p+1}}\right)$ 次迭代，才能以高概率获得 $\varepsilon$-最优估计量。在 $1$-Wasserstein距离情形下，该样本复杂度界是极小极大最优的（对数因子除外）。为实现这一结果，我们在希尔伯特空间中建立了新的Freedman不等式，该结果可能具有独立的研究价值。此外，我们重新审视CTD，证明在 $p$-Wasserstein距离情形下，CTD同样满足相同的非渐近收敛界。