We develop a central limit theorem (CLT) for a non-parametric estimator of the transition matrices in controlled Markov chains (CMCs) with finite state-action spaces. Our results establish precise conditions on the logging policy under which the estimator is asymptotically normal, and reveal settings in which no CLT can exist. We then build on it to derive CLTs for the value, Q-, and advantage functions of any stationary stochastic policy, including the optimal policy recovered from the estimated model. Goodness-of-fit tests are derived as a corollary, which enable to test whether the logged data is stochastic. These results provide new statistical tools for offline policy evaluation and optimal policy recovery, and enable hypothesis tests for transition probabilities.
翻译:我们针对有限状态-动作空间的受控马尔可夫链(CMC)中转移矩阵的非参数估计量,建立了中心极限定理(CLT)。我们的结果确立了在记录策略下使估计量渐近正态的精确条件,并揭示了不存在CLT的设定。在此基础上,我们推导出任意平稳随机策略(包括从估计模型中恢复的最优策略)的值函数、Q函数和优势函数的CLT。作为推论,我们得到了拟合优度检验,可用于检验记录数据是否为随机生成。这些结果为离线策略评估和最优策略恢复提供了新的统计工具,并支持对转移概率进行假设检验。