Optimizing survival outcomes, such as patient survival or customer retention, is a critical objective in data-driven decision-making. Off-Policy Evaluation~(OPE) provides a powerful framework for assessing such decision-making policies using logged data alone, without the need for costly or risky online experiments in high-stakes applications. However, typical estimators are not designed to handle right-censored survival outcomes, as they ignore unobserved survival times beyond the censoring time, leading to systematic underestimation of the true policy performance. To address this issue, we propose a novel framework for OPE and Off-Policy Learning~(OPL) tailored for survival outcomes under censoring. Specifically, we introduce IPCW-IPS and IPCW-DR, which employ the Inverse Probability of Censoring Weighting technique to explicitly deal with censoring bias. We theoretically establish that our estimators are unbiased and that IPCW-DR achieves double robustness, ensuring consistency if either the propensity score or the outcome model is correct. Furthermore, we extend this framework to constrained OPL to optimize policy value under budget constraints. We demonstrate the effectiveness of our proposed methods through simulation studies and illustrate their practical impacts using public real-world data for both evaluation and learning tasks.
翻译:优化生存结局(如患者生存期或客户留存率)是数据驱动决策的关键目标。离策略评估(OPE)提供了一种强大框架,仅利用已记录数据即可评估此类决策策略,无需在高风险应用中开展昂贵或风险较大的在线实验。然而,典型估计器未针对右删失生存结局进行设计,因其忽略了删失时间后的未观测生存时长,导致对真实策略性能的系统性低估。为解决该问题,我们提出一种专为删失下生存结局定制的离策略评估与离策略学习(OPL)新框架。具体而言,我们引入了IPCW-IPS和IPCW-DR方法,采用逆删失概率加权技术显式处理删失偏差。我们从理论上证明所提估计器无偏,且IPCW-DR具有双重稳健性——当倾向性得分或结局模型其一正确时即可保证一致性。此外,我们将该框架扩展至约束离策略学习,以在预算约束下优化策略价值。通过模拟研究验证方法有效性,并利用公开真实数据在评估与学习任务中展示其实际影响。