We propose and analyze a model-based bootstrap for transition kernels in finite controlled Markov chains (CMCs) with possibly nonstationary or history-dependent control policies, a setting that arises naturally in offline reinforcement learning (RL) when the behavior policy generating the data is unknown. We establish distributional consistency of the bootstrap transition estimator in both a single long-chain regime and the episodic offline RL regime. The key technical tools are a novel bootstrap law of large numbers (LLN) for the visitation counts and a novel use of the martingale central limit theorem (CLT) for the bootstrap transition increments. We extend bootstrap distributional consistency to the downstream targets of offline policy evaluation (OPE) and optimal policy recovery (OPR) via the delta method by verifying Hadamard differentiability of the Bellman operators, yielding asymptotically valid confidence intervals for value and $Q$-functions. Experiments on the RiverSwim problem show that the proposed bootstrap confidence intervals (CIs), especially the percentile CIs, outperform the episodic bootstrap and plug-in CLT CIs, and are often close to nominal ($50\%$, $90\%$, $95\%$) coverage, while the baselines are poorly calibrated at small sample sizes and short episode lengths.
翻译:我们提出并分析了一种针对有限有控马尔可夫链(CMCs)中转移核的基于模型的自助法,该框架适用于可能非平稳或具有历史依赖性的控制策略——这一设定自然出现于离线强化学习(RL)中当生成数据的行为策略未知时。我们在单条长链场景和情景式离线强化学习场景中均建立了自助法转移估计量的分布一致性。关键技术工具是用于访问计数的新型自助法大数定律(LLN)以及用于自助法转移增量的鞅中心极限定理(CLT)的创新应用。通过验证Bellman算子的哈达玛可微性,我们借助Delta方法将自助法分布一致性拓展至离线策略评估(OPE)和最优策略恢复(OPR)的下游目标,从而得到价值函数和Q函数的渐近有效置信区间。在RiverSwim问题上的实验表明,所提出的自助法置信区间(CI)(尤其是百分位CI)优于情景式自助法和插入式CLT置信区间,其在名义覆盖率(50%、90%、95%)附近表现良好,而基线方法在小样本量和短情景长度时校准效果较差。