Domain gaps are among the most relevant roadblocks in the clinical translation of machine learning (ML)-based solutions for medical image analysis. While current research focuses on new training paradigms and network architectures, little attention is given to the specific effect of prevalence shifts on an algorithm deployed in practice. Such discrepancies between class frequencies in the data used for a method's development/validation and that in its deployment environment(s) are of great importance, for example in the context of artificial intelligence (AI) democratization, as disease prevalences may vary widely across time and location. Our contribution is twofold. First, we empirically demonstrate the potentially severe consequences of missing prevalence handling by analyzing (i) the extent of miscalibration, (ii) the deviation of the decision threshold from the optimum, and (iii) the ability of validation metrics to reflect neural network performance on the deployment population as a function of the discrepancy between development and deployment prevalence. Second, we propose a workflow for prevalence-aware image classification that uses estimated deployment prevalences to adjust a trained classifier to a new environment, without requiring additional annotated deployment data. Comprehensive experiments based on a diverse set of 30 medical classification tasks showcase the benefit of the proposed workflow in generating better classifier decisions and more reliable performance estimates compared to current practice.
翻译:领域差异是基于机器学习的医学图像分析解决方案在临床转化中最相关的障碍之一。当前研究集中于新的训练范式与网络架构,但很少关注流行率偏移对实际部署算法产生的具体影响。用于方法开发/验证的数据与部署环境中的类别频率差异具有重大意义,例如在人工智能民主化背景下,疾病流行率可能因时间和地点而异。我们的贡献有两方面:首先,通过分析(i)校准误差程度、(ii)决策阈值与最优值的偏差,以及(iii)验证指标反映部署人群上神经网络性能的能力(作为开发与部署流行率差异的函数),我们实证展示了缺乏流行率处理可能导致的严重后果。其次,我们提出了一种针对流行率感知的图像分类工作流程,该流程利用估计的部署流行率调整已训练的分类器以适应新环境,且无需额外标注的部署数据。基于30项多样化医学分类任务的综合实验表明,与当前实践相比,所提工作流程在生成更优分类器决策和更可靠的性能评估方面具有优势。