After a machine learning (ML)-based system is deployed in clinical practice, performance monitoring is important to ensure the safety and effectiveness of the algorithm over time. The goal of this work is to highlight the complexity of designing a monitoring strategy and the need for a systematic framework that compares the multitude of monitoring options. One of the main decisions is choosing between using real-world (observational) versus interventional data. Although the former is the most convenient source of monitoring data, it exhibits well-known biases, such as confounding, selection, and missingness. In fact, when the ML algorithm interacts with its environment, the algorithm itself may be a primary source of bias. On the other hand, a carefully designed interventional study that randomizes individuals can explicitly eliminate such biases, but the ethics, feasibility, and cost of such an approach must be carefully considered. Beyond the decision of the data source, monitoring strategies vary in the performance criteria they track, the interpretability of the test statistics, the strength of their assumptions, and their speed at detecting performance decay. As a first step towards developing a framework that compares the various monitoring options, we consider a case study of an ML-based risk prediction algorithm for postoperative nausea and vomiting (PONV). Bringing together tools from causal inference and statistical process control, we walk through the basic steps of defining candidate monitoring criteria, describing potential sources of bias and the causal model, and specifying and comparing candidate monitoring procedures. We hypothesize that these steps can be applied more generally, as causal inference can address other sources of biases as well.
翻译:在基于机器学习(ML)的系统投入临床实践后,性能监测对于确保算法长期安全性和有效性至关重要。本工作旨在强调设计监测策略的复杂性,以及建立系统性框架以对比多种监测选项的必要性。主要决策之一在于选择使用真实世界(观察性)数据还是干预性数据。尽管前者是最便捷的监测数据来源,但其存在众所周知的偏倚,如混杂、选择偏倚和数据缺失。事实上,当ML算法与环境交互时,算法本身可能成为偏倚的主要来源。另一方面,精心设计的随机化干预研究可明确消除此类偏倚,但必须审慎考虑该方法的伦理性、可行性和成本。除了数据来源的决策,监测策略在追踪的性能标准、检验统计量的可解释性、假设强度以及检测性能衰减的速度方面存在差异。作为建立多种监测方案对比框架的第一步,我们以基于ML的术后恶心呕吐(PONV)风险预测算法为案例。综合运用因果推断和统计过程控制工具,我们逐步阐述了定义候选监测标准、描述潜在偏倚来源与因果模型、以及确定并对比候选监测流程的基本步骤。我们假设这些步骤可推广应用于更广泛的场景,因为因果推断同样能够处理其他类型的偏倚来源。