Prediction under interventions: evaluation of counterfactual performance using longitudinal observational data

Predictions under interventions are estimates of what a person's risk of an outcome would be if they were to follow a particular treatment strategy, given their individual characteristics. Such predictions can give important input to medical decision making. However, evaluating predictive performance of interventional predictions is challenging. Standard ways of evaluating predictive performance do not apply when using observational data, because prediction under interventions involves obtaining predictions of the outcome under conditions that are different to those that are observed for a subset of individuals in the validation dataset. This work describes methods for evaluating counterfactual performance of predictions under interventions for time-to-event outcomes. This means we aim to assess how well predictions would match the validation data if all individuals had followed the treatment strategy under which predictions are made. We focus on counterfactual performance evaluation using longitudinal observational data, and under treatment strategies that involve sustaining a particular treatment regime over time. We introduce an estimation approach using artificial censoring and inverse probability weighting which involves creating a validation dataset that mimics the treatment strategy under which predictions are made. We extend measures of calibration, discrimination (c-index and cumulative/dynamic AUCt) and overall prediction error (Brier score) to allow assessment of counterfactual performance. The methods are evaluated using a simulation study, including scenarios in which the methods should detect poor performance. Applying our methods in the context of liver transplantation shows that our procedure allows quantification of the performance of predictions supporting crucial decisions on organ allocation.

翻译：干预条件下的预测是根据个体特征，估计其若遵循特定治疗策略时发生某种结局的风险。这类预测可为医疗决策提供重要依据。然而，评估干预性预测的预测性能具有挑战性。在使用观察数据时，传统的预测性能评估方法不适用，因为干预条件下的预测需要获取验证数据集中部分个体在不同于观测条件下的结局预测值。本研究描述了针对时间-事件结局的干预性预测反事实性能评估方法，旨在评估若所有个体均遵循预测所依据的治疗策略时，预测结果与验证数据的匹配程度。我们聚焦于基于纵向观察数据的反事实性能评估，并针对需长期维持特定治疗方案的策略。通过引入人工删失和逆概率加权方法，我们构建了模拟预测所依据治疗策略的验证数据集。我们扩展了校准度、区分度（c指数和累积/动态AUCt）及整体预测误差（布里尔分数）等指标，使其适用于反事实性能评估。通过模拟研究对方法进行评估（包括检测模型不良性能的场景），并将该方法应用于肝移植数据分析。结果表明，该程序能够量化支持器官分配关键决策的预测性能。