In this work, we study a natural nonparametric estimator of the transition probability matrices of a finite controlled Markov chain. We consider an offline setting with a fixed dataset, collected using a so-called logging policy. We develop sample complexity bounds for the estimator and establish conditions for minimaxity. Our statistical bounds depend on the logging policy through its mixing properties. We show that achieving a particular statistical risk bound involves a subtle and interesting trade-off between the strength of the mixing properties and the number of samples. We demonstrate the validity of our results under various examples, such as ergodic Markov chains, weakly ergodic inhomogeneous Markov chains, and controlled Markov chains with non-stationary Markov, episodic, and greedy controls. Lastly, we use these sample complexity bounds to establish concomitant ones for offline evaluation of stationary Markov control policies.
翻译:本文研究有限受控马尔可夫链转移概率矩阵的一种自然非参数估计量。我们考虑一个固定数据集的离线场景,该数据集通过所谓的日志策略收集。我们为估计量建立了样本复杂度界限,并给出了极小极大性的条件。我们的统计界限通过日志策略的混合性质依赖于该策略。研究表明,实现特定的统计风险界限涉及混合强度与样本数量之间微妙而有趣的权衡。我们通过在多种例子下验证结果的有效性,包括遍历马尔可夫链、弱遍历非齐次马尔可夫链以及具有非平稳马尔可夫、片段和贪婪控制的受控马尔可夫链。最后,利用这些样本复杂度界限,我们建立了用于离线评估平稳马尔可夫控制策略的相应界限。