We develop a central limit theorem (CLT) for the non-parametric estimator of the transition matrices in controlled Markov chains (CMCs) with finite state-action spaces. Our results establish precise conditions on the logging policy under which the estimator is asymptotically normal, and reveal settings in which no CLT can exist. We then build upon it to derive CLTs for the value, Q-, and advantage functions of any stationary stochastic policy, including the optimal policy recovered from the estimated model. Goodness-of-fit tests are derived as a corollary, which enable us to test whether the logged data is stochastic. These results provide new statistical tools for offline policy evaluation and optimal policy recovery, and enable hypothesis tests for transition probabilities.
翻译:针对具有有限状态-动作空间的受控马尔可夫链(CMCs),我们建立了其转移矩阵非参数估计量的中心极限定理(CLT)。我们的研究结果明确了记录策略需满足的精确条件,使得估计量在此条件下具有渐近正态性,并揭示了不存在CLT的情形。在此基础上,我们进一步推导了任意平稳随机策略(包括从估计模型中恢复的最优策略)的价值函数、Q-函数和优势函数的CLT。作为推论,我们推导了拟合优度检验方法,从而能够检验记录数据是否具有随机性。这些结果为离线策略评估和最优策略恢复提供了新的统计工具,并实现了对转移概率的假设检验。