In this paper, we study distributional reinforcement learning from the perspective of statistical efficiency. We investigate distributional policy evaluation, aiming to estimate the complete distribution of the random return (denoted $\eta^\pi$) attained by a given policy $\pi$. We use the certainty-equivalence method to construct our estimator $\hat\eta^\pi$, given a generative model is available. We show that in this circumstance we need a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\epsilon^{2p}(1-\gamma)^{2p+2}}\right)$ to guarantee a $p$-Wasserstein metric between $\hat\eta^\pi$ and $\eta^\pi$ is less than $\epsilon$ with high probability. This implies the distributional policy evaluation problem can be solved with sample efficiency. Also, we show that under different mild assumptions a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\epsilon^{2}(1-\gamma)^{4}}\right)$ suffices to ensure the Kolmogorov metric and total variation metric between $\hat\eta^\pi$ and $\eta^\pi$ is below $\epsilon$ with high probability. Furthermore, we investigate the asymptotic behavior of $\hat\eta^\pi$. We demonstrate that the ``empirical process'' $\sqrt{n}(\hat\eta^\pi-\eta^\pi)$ converges weakly to a Gaussian process in the space of bounded functionals on Lipschitz function class $\ell^\infty(\mathcal{F}_{W_1})$, also in the space of bounded functionals on indicator function class $\ell^\infty(\mathcal{F}_{\mathrm{KS}})$ and bounded measurable function class $\ell^\infty(\mathcal{F}_{\mathrm{TV}})$ when some mild conditions hold. Our findings give rise to a unified approach to statistical inference of a wide class of statistical functionals of $\eta^\pi$.
翻译:本文从统计效率的角度研究分布强化学习。我们探究分布策略评估问题,旨在估计给定策略π所获得的随机回报η^π的完整分布。在具备生成模型的情况下,我们使用确定性等价方法构建估计量η̂^π。研究表明,在该背景下需要规模为$\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\epsilon^{2p}(1-\gamma)^{2p+2}}\right)$的数据集,方能以高概率保证η̂^π与η^π之间的p-Wasserstein距离小于ϵ。这意味着分布策略评估问题可具备样本效率。此外,我们证明在温和假设下,规模为$\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\epsilon^{2}(1-\gamma)^{4}}\right)$的数据集足以确保η̂^π与η^π之间的Kolmogorov距离和全变差距离以高概率低于ϵ。进一步地,我们研究了η̂^π的渐近行为。证明在适当条件成立时,“经验过程”$\sqrt{n}(\hat\eta^\pi-\eta^\pi)$在Lipschitz函数类上的有界泛函空间$\ell^\infty(\mathcal{F}_{W_1})$、示性函数类上的有界泛函空间$\ell^\infty(\mathcal{F}_{\mathrm{KS}})$以及有界可测函数类上的有界泛函空间$\ell^\infty(\mathcal{F}_{\mathrm{TV}})$中均弱收敛于高斯过程。这一发现为η^π的广泛统计泛函的统计推断提供了统一方法。