We study Q-learning with Polyak-Ruppert averaging in a discounted Markov decision process in synchronous and tabular settings. Under a Lipschitz condition, we establish a functional central limit theorem for the averaged iteration $\bar{\boldsymbol{Q}}_T$ and show that its standardized partial-sum process converges weakly to a rescaled Brownian motion. The functional central limit theorem implies a fully online inference method for reinforcement learning. Furthermore, we show that $\bar{\boldsymbol{Q}}_T$ is the regular asymptotically linear (RAL) estimator for the optimal Q-value function $\boldsymbol{Q}^*$ that has the most efficient influence function. We present a nonasymptotic analysis for the $\ell_{\infty}$ error, $\mathbb{E}\|\bar{\boldsymbol{Q}}_T-\boldsymbol{Q}^*\|_{\infty}$, showing that it matches the instance-dependent lower bound for polynomial step sizes. Similar results are provided for entropy-regularized Q-learning without the Lipschitz condition.
翻译:我们在折扣马尔可夫决策过程的同步和表格设置中研究带Polyak-Ruppert平均的Q学习。在Lipschitz条件下,我们建立了平均迭代$\bar{\boldsymbol{Q}}_T$的函数中心极限定理,并证明其标准化部分和过程弱收敛到一个重标度的布朗运动。该函数中心极限定理为强化学习提供了一种完全在线推断方法。此外,我们证明$\bar{\boldsymbol{Q}}_T$是最优Q值函数$\boldsymbol{Q}^*$的正则渐近线性估计量,具有最有效的影响函数。我们给出了$\ell_{\infty}$误差$\mathbb{E}\|\bar{\boldsymbol{Q}}_T-\boldsymbol{Q}^*\|_{\infty}$的非渐近分析,表明对于多项式步长,该误差与问题相关的下界相匹配。对于无Lipschitz条件的熵正则化Q学习,我们也提供了类似的结果。