We study high-confidence off-policy evaluation in the context of infinite-horizon Markov decision processes, where the objective is to establish a confidence interval (CI) for the target policy value using only offline data pre-collected from unknown behavior policies. This task faces two primary challenges: providing a comprehensive and rigorous error quantification in CI estimation, and addressing the distributional shift that results from discrepancies between the distribution induced by the target policy and the offline data-generating process. Motivated by an innovative unified error analysis, we jointly quantify the two sources of estimation errors: the misspecification error on modeling marginalized importance weights and the statistical uncertainty due to sampling, within a single interval. This unified framework reveals a previously hidden tradeoff between the errors, which undermines the tightness of the CI. Relying on a carefully designed discriminator function, the proposed estimator achieves a dual purpose: breaking the curse of the tradeoff to attain the tightest possible CI, and adapting the CI to ensure robustness against distributional shifts. Our method is applicable to time-dependent data without assuming any weak dependence conditions via leveraging a local supermartingale/martingale structure. Theoretically, we show that our algorithm is sample-efficient, error-robust, and provably convergent even in non-linear function approximation settings. The numerical performance of the proposed method is examined in synthetic datasets and an OhioT1DM mobile health study.
翻译:我们研究无限时域马尔可夫决策过程中的高置信度离线策略评估,其目标是仅利用从未知行为策略预收集的离线数据,为目标策略值建立置信区间。该任务面临两个主要挑战:在置信区间估计中提供全面且严格的误差量化,以及处理由目标策略诱导的分布与离线数据生成过程之间的差异导致的分布偏移。受创新性统一误差分析的启发,我们在单个区间内联合量化了两类估计误差:对边际化重要性权重建模的设定误差和由采样引起的统计不确定性。这一统一框架揭示了此前被隐藏的误差间权衡关系,该权衡会损害置信区间的紧致性。通过精心设计的判别器函数,所提出的估计器实现了双重目标:打破误差权衡的诅咒以获得最紧可能的置信区间,并自适应调整置信区间以增强对分布偏移的鲁棒性。我们的方法适用于时间依赖数据,无需借助局部超鞅/鞅结构假设任何弱依赖条件。理论上,我们证明了该算法在非线性函数逼近设置下具有样本高效性、误差鲁棒性和可证明收敛性。所提方法的数值性能通过合成数据集和俄亥俄1型糖尿病移动医疗研究得到验证。