Prediction under hypothetical interventions: evaluation of counterfactual performance using longitudinal observational data

Predictions under hypothetical interventions are estimates of what a person's risk of an outcome would be if they were to follow a particular treatment strategy, given their individual characteristics. Such predictions can give important input to medical decision making. However, evaluating predictive performance of interventional predictions is challenging. Standard ways of evaluating predictive performance do not apply when using observational data, because prediction under interventions involves obtaining predictions of the outcome under conditions that are different to those that are observed for a subset of individuals in the validation dataset. This work describes methods for evaluating counterfactual predictive performance of predictions under interventions for time-to-event outcomes. This means we aim to assess how well predictions would match the validation data if all individuals had followed the treatment strategy under which predictions are made. We focus on counterfactual performance evaluation using longitudinal observational data, and under treatment strategies that involve sustaining a particular treatment regime over time. We introduce an estimation approach using artificial censoring and inverse probability weighting which involves creating a validation dataset that mimics the treatment strategy under which predictions are made. We extend measures of calibration, discrimination (c-index and cumulative/dynamic AUC) and overall prediction error (Brier score) to allow assessment of counterfactual performance. The methods are evaluated using a simulation study, including scenarios in which the methods should detect poor performance. Applying our methods in the context of liver transplantation shows that our procedure allows quantification of the performance of predictions supporting crucial decisions on organ allocation.

翻译：假设干预下的预测是指，根据个体特征，若其遵循特定治疗策略时可能发生的结果风险估计值。此类预测可为医疗决策提供重要依据。然而，评估干预性预测的预测性能颇具挑战性。当使用观察数据时，标准的预测性能评估方法并不适用，因为干预下的预测涉及在不同于验证数据集中部分个体实际观察条件下获取结果预测。本研究描述了针对时间-事件结局的干预下预测反事实性能的评估方法。这意味着我们旨在评估：若所有个体均遵循预测所依据的治疗策略时，预测结果与验证数据的匹配程度。我们重点研究使用纵向观察数据，在涉及随时间维持特定治疗方案的策略下进行反事实性能评估。我们引入了一种基于人工删失和逆概率加权的估计方法，通过构建模拟预测所依据治疗策略的验证数据集，扩展了校准度、区分度（c指数及累积/动态AUC）和总体预测误差（Brier评分）等指标以评估反事实性能。通过模拟研究（包括方法应能检测出不良表现的场景）对方法进行评估。将本方法应用于肝移植领域表明，该流程可量化支撑器官分配关键决策的预测性能。