Domain gaps are among the most relevant roadblocks in the clinical translation of machine learning (ML)-based solutions for medical image analysis. While current research focuses on new training paradigms and network architectures, little attention is given to the specific effect of prevalence shifts on an algorithm deployed in practice. Such discrepancies between class frequencies in the data used for a method's development/validation and that in its deployment environment(s) are of great importance, for example in the context of artificial intelligence (AI) democratization, as disease prevalences may vary widely across time and location. Our contribution is twofold. First, we empirically demonstrate the potentially severe consequences of missing prevalence handling by analyzing (i) the extent of miscalibration, (ii) the deviation of the decision threshold from the optimum, and (iii) the ability of validation metrics to reflect neural network performance on the deployment population as a function of the discrepancy between development and deployment prevalence. Second, we propose a workflow for prevalence-aware image classification that uses estimated deployment prevalences to adjust a trained classifier to a new environment, without requiring additional annotated deployment data. Comprehensive experiments based on a diverse set of 30 medical classification tasks showcase the benefit of the proposed workflow in generating better classifier decisions and more reliable performance estimates compared to current practice.
翻译:领域差异是基于机器学习的医学图像分析解决方案在临床转化中最相关的障碍之一。当前研究主要关注新的训练范式和网络架构,但对实践中算法部署时疾病流行率漂移的具体影响关注甚少。方法开发/验证所用数据中的类别频率与其部署环境中的类别频率之间的这种差异至关重要,例如在人工智能(AI)普及化的背景下,疾病流行率可能随时间和地点发生显著变化。我们的贡献分为两点。首先,我们通过分析(i)校准误差的程度,(ii)决策阈值与最优值的偏差,以及(iii)验证指标反映部署人群中神经网络性能的能力(作为开发与部署流行率之间差异的函数),实证证明了忽视流行率处理可能带来的严重后果。其次,我们提出了一种感知流行率的图像分类工作流程,该流程利用估计的部署流行率调整训练好的分类器以适应新环境,而无需额外的标注部署数据。基于30项多样化医学分类任务的综合实验表明,与当前实践相比,所提出的工作流程能够在生成更优的分类决策和更可靠的性能估计方面带来益处。