Estimation and Inference in Distributional Reinforcement Learning

In this paper, we study distributional reinforcement learning from the perspective of statistical efficiency. We investigate distributional policy evaluation, aiming to estimate the complete distribution of the random return (denoted $\eta^\pi$) attained by a given policy $\pi$. We use the certainty-equivalence method to construct our estimator $\hat\eta^\pi$, given a generative model is available. We show that in this circumstance we need a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\epsilon^{2p}(1-\gamma)^{2p+2}}\right)$ to guarantee a $p$-Wasserstein metric between $\hat\eta^\pi$ and $\eta^\pi$ is less than $\epsilon$ with high probability. This implies the distributional policy evaluation problem can be solved with sample efficiency. Also, we show that under different mild assumptions a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\epsilon^{2}(1-\gamma)^{4}}\right)$ suffices to ensure the Kolmogorov metric and total variation metric between $\hat\eta^\pi$ and $\eta^\pi$ is below $\epsilon$ with high probability. Furthermore, we investigate the asymptotic behavior of $\hat\eta^\pi$. We demonstrate that the ``empirical process'' $\sqrt{n}(\hat\eta^\pi-\eta^\pi)$ converges weakly to a Gaussian process in the space of bounded functionals on Lipschitz function class $\ell^\infty(\mathcal{F}_{W_1})$, also in the space of bounded functionals on indicator function class $\ell^\infty(\mathcal{F}_{\mathrm{KS}})$ and bounded measurable function class $\ell^\infty(\mathcal{F}_{\mathrm{TV}})$ when some mild conditions hold. Our findings give rise to a unified approach to statistical inference of a wide class of statistical functionals of $\eta^\pi$.

翻译：本文从统计效率的角度研究分布强化学习。我们探究分布策略评估问题，旨在估计给定策略π所获得的随机回报η^π的完整分布。在具备生成模型的情况下，我们使用确定性等价方法构建估计量η̂^π。研究表明，在该背景下需要规模为$\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\epsilon^{2p}(1-\gamma)^{2p+2}}\right)$的数据集，方能以高概率保证η̂^π与η^π之间的p-Wasserstein距离小于ϵ。这意味着分布策略评估问题可具备样本效率。此外，我们证明在温和假设下，规模为$\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\epsilon^{2}(1-\gamma)^{4}}\right)$的数据集足以确保η̂^π与η^π之间的Kolmogorov距离和全变差距离以高概率低于ϵ。进一步地，我们研究了η̂^π的渐近行为。证明在适当条件成立时，“经验过程”$\sqrt{n}(\hat\eta^\pi-\eta^\pi)$在Lipschitz函数类上的有界泛函空间$\ell^\infty(\mathcal{F}_{W_1})$、示性函数类上的有界泛函空间$\ell^\infty(\mathcal{F}_{\mathrm{KS}})$以及有界可测函数类上的有界泛函空间$\ell^\infty(\mathcal{F}_{\mathrm{TV}})$中均弱收敛于高斯过程。这一发现为η^π的广泛统计泛函的统计推断提供了统一方法。