The most relevant problems in discounted reinforcement learning involve estimating the mean of a function under the stationary distribution of a Markov reward process, such as the expected return in policy evaluation, or the policy gradient in policy optimization. In practice, these estimates are produced through a finite-horizon episodic sampling, which neglects the mixing properties of the Markov process. It is mostly unclear how this mismatch between the practical and the ideal setting affects the estimation, and the literature lacks a formal study on the pitfalls of episodic sampling, and how to do it optimally. In this paper, we present a minimax lower bound on the discounted mean estimation problem that explicitly connects the estimation error with the mixing properties of the Markov process and the discount factor. Then, we provide a statistical analysis on a set of notable estimators and the corresponding sampling procedures, which includes the finite-horizon estimators often used in practice. Crucially, we show that estimating the mean by directly sampling from the discounted kernel of the Markov process brings compelling statistical properties w.r.t. the alternative estimators, as it matches the lower bound without requiring a careful tuning of the episode horizon.
翻译:折扣强化学习中最相关的问题涉及估计马尔可夫奖励过程平稳分布下函数的均值,例如策略评估中的期望回报或策略优化中的策略梯度。在实际操作中,这些估计是通过有限时域的情节采样产生的,该方法忽略了马尔可夫过程的混合特性。这种实际设置与理想设置之间的不匹配如何影响估计尚不明确,现有文献也缺乏对情节采样陷阱及其最优实现方式的正式研究。本文首先针对折扣均值估计问题提出极小化极大下界,该下界明确地将估计误差与马尔可夫过程的混合特性及折扣因子联系起来。随后,我们对一系列重要估计量及相应的采样过程(包括实践中常用的有限时域估计量)进行了统计分析。关键的是,我们证明:通过直接从马尔可夫过程的折扣核进行采样来估计均值,相较于其他估计量具有显著的统计优势,因为它无需精细调节情节时域即可达到下界。