Off-policy evaluation (OPE) methods estimate the value of a new reinforcement learning (RL) policy prior to deployment. Recent advances have shown that leveraging auxiliary datasets, such as those synthesized by generative models, can improve the accuracy of OPE methods. Unfortunately, such auxiliary datasets may also be biased, and existing methods for using data augmentation within OPE lack principled uncertainty quantification. In high stakes domains like healthcare, reliable uncertainty estimates are important for ensuring safe and informed deployment of RL policies. In this work, we propose two methods to construct valid confidence intervals for OPE with data augmentation. The first provides a confidence interval over $V^π(s)$, the policy value conditioned on an initial state $s$. To do so we introduce a new conformal prediction method suitable for Markov Decision Processes (MDPs) with continuous state spaces, extending prior work to higher-dimensional settings. Second, we consider the more common task of estimating the average policy performance over all initial states, $V^π$; we introduce a method that draws on ideas from doubly robust estimation and prediction powered inference. Across simulators spanning inventory management, robotics, healthcare, and a real healthcare dataset from MIMIC-IV, we find that our methods can effectively leverage auxiliary data and consistently produce confidence intervals that cover the ground truth policy values, unlike previously proposed methods. Our work enables a future in which OPE can provide rigorous uncertainty estimates for high-stakes domains.
翻译:离线策略评估(OPE)方法用于在部署前估计新强化学习(RL)策略的价值。近期进展表明,利用辅助数据集(如生成模型合成的数据)可提升OPE方法的准确性。然而,这类辅助数据集可能存在偏差,且现有使用数据增强的OPE方法缺乏原理性的不确定性量化能力。在医疗保健等高风险领域,可靠的不确定性估计对确保RL策略的安全知情部署至关重要。本文提出两种基于数据增强构建OPE有效置信区间的方法。第一种方法针对条件于初始状态$s$的策略价值$V^π(s)$构建置信区间,为此我们引入一种适用于连续状态空间马尔可夫决策过程(MDP)的新共形预测方法,将先前工作推广至高维场景。第二种方法针对更常见的任务——估计所有初始状态下的平均策略性能$V^π$,我们提出融合双重稳健估计与预测驱动推断思想的方法。在涵盖库存管理、机器人学、医疗健康及MIMIC-IV真实医疗数据集的模拟实验中,我们发现所提方法能有效利用辅助数据,持续生成覆盖真实策略价值的置信区间,而现有方法无法实现这一点。本研究为OPE在高风险领域提供严谨不确定性估计奠定了基础。