We study high-confidence off-policy evaluation in the context of infinite-horizon Markov decision processes, where the objective is to establish a confidence interval (CI) for the target policy value using only offline data pre-collected from unknown behavior policies. This task faces two primary challenges: providing a comprehensive and rigorous error quantification in CI estimation, and addressing the distributional shift that results from discrepancies between the distribution induced by the target policy and the offline data-generating process. Motivated by an innovative unified error analysis, we jointly quantify the two sources of estimation errors: the misspecification error on modeling marginalized importance weights and the statistical uncertainty due to sampling, within a single interval. This unified framework reveals a previously hidden tradeoff between the errors, which undermines the tightness of the CI. Relying on a carefully designed discriminator function, the proposed estimator achieves a dual purpose: breaking the curse of the tradeoff to attain the tightest possible CI, and adapting the CI to ensure robustness against distributional shifts. Our method is applicable to time-dependent data without assuming any weak dependence conditions via leveraging a local supermartingale/martingale structure. Theoretically, we show that our algorithm is sample-efficient, error-robust, and provably convergent even in non-linear function approximation settings. The numerical performance of the proposed method is examined in synthetic datasets and an OhioT1DM mobile health study.
翻译:我们研究无限时域马尔可夫决策过程中高置信度离策略评估问题,目标是通过仅使用从未知行为策略预收集的离线数据,为目标策略值建立置信区间。该任务面临两个主要挑战:在置信区间估计中提供全面且严格的误差量化,以及解决由目标策略诱导分布与离线数据生成过程之间差异导致的分布偏移问题。受创新的统一误差分析启发,我们在单个区间内联合量化两类估计误差:边际化重要性权重建模中的规范误差和因采样导致的统计不确定性。这种统一框架揭示了先前隐藏的误差间权衡关系,该关系会削弱置信区间的紧致性。依靠精心设计的判别器函数,所提估计器实现双重目标:打破权衡诅咒以获得尽可能紧致的置信区间,并调整置信区间以确保对分布偏移的鲁棒性。我们的方法适用于时间依赖数据,无需通过利用局部超鞅/鞅结构假设任何弱依赖条件。理论上,我们证明即使在非线性函数逼近设置下,算法仍具有样本高效、误差鲁棒且可证明收敛的特性。通过合成数据集和俄亥俄1型糖尿病移动健康研究验证了所提方法的数值性能。