Performance evaluation of predictive AI models to support medical decisions: Overview and guidance

Ben Van Calster,Gary S. Collins,Andrew J. Vickers,Laure Wynants,Kathleen F. Kerr,Lasai Barreñada,Gael Varoquaux,Karandeep Singh,Karel G. M. Moons,Tina Hernandez-boussard,Dirk Timmerman,David J. Mclernon,Maarten Van Smeden,Ewout W. Steyerberg

from arxiv, 60 pages, 8 tables, 11 figures, two supplementary appendices

A myriad of measures to illustrate performance of predictive artificial intelligence (AI) models have been proposed in the literature. Selecting appropriate performance measures is essential for predictive AI models that are developed to be used in medical practice, because poorly performing models may harm patients and lead to increased costs. We aim to assess the merits of classic and contemporary performance measures when validating predictive AI models for use in medical practice. We focus on models with a binary outcome. We discuss 32 performance measures covering five performance domains (discrimination, calibration, overall, classification, and clinical utility) along with accompanying graphical assessments. The first four domains cover statistical performance, the fifth domain covers decision-analytic performance. We explain why two key characteristics are important when selecting which performance measures to assess: (1) whether the measure's expected value is optimized when it is calculated using the correct probabilities (i.e., a "proper" measure), and (2) whether they reflect either purely statistical performance or decision-analytic performance by properly considering misclassification costs. Seventeen measures exhibit both characteristics, fourteen measures exhibited one characteristic, and one measure possessed neither characteristic (the F1 measure). All classification measures (such as classification accuracy and F1) are improper for clinically relevant decision thresholds other than 0.5 or the prevalence. We recommend the following measures and plots as essential to report: AUROC, calibration plot, a clinical utility measure such as net benefit with decision curve analysis, and a plot with probability distributions per outcome category.

翻译：文献中已提出大量用于说明预测性人工智能（AI）模型性能的度量指标。选择适当的性能度量对于开发用于医疗实践的预测性AI模型至关重要，因为性能低下的模型可能对患者造成伤害并导致成本增加。本文旨在评估经典与当代性能度量指标在验证医疗实践用预测性AI模型时的价值。我们聚焦于具有二元结果的模型，讨论了涵盖五个性能领域（区分度、校准度、整体性能、分类性能及临床效用）的32项性能度量及其配套的图形化评估方法。前四个领域涵盖统计性能，第五个领域涵盖决策分析性能。我们阐释了选择评估性能度量时两个关键特征的重要性：（1）当使用正确概率计算时，该度量的期望值是否达到最优（即“恰当”度量）；（2）是否通过恰当考虑误分类成本，反映纯统计性能或决策分析性能。17项度量同时具备这两个特征，14项度量具备一个特征，一项度量（F1度量）两者皆不具备。所有分类度量（如分类准确率和F1）对于除0.5或患病率之外的临床相关决策阈值均不恰当。我们建议以下度量和图表作为必须报告的内容：AUROC、校准曲线、临床效用度量（如决策曲线分析中的净收益）以及按结果类别划分的概率分布图。