Confidence intervals uncovered: Are we ready for real-world medical imaging AI?

Evangelia Christodoulou,Annika Reinke,Rola Houhou,Piotr Kalinowski,Selen Erkan,Carole H. Sudre,Ninon Burgos,Sofiène Boutaj,Sophie Loizillon,Maëlys Solal,Nicola Rieke,Veronika Cheplygina,Michela Antonelli,Leon D. Mayer,Minu D. Tizabi,M. Jorge Cardoso,Amber Simpson,Paul F. Jäger,Annette Kopp-Schneider,Gaël Varoquaux,Olivier Colliot,Lena Maier-Hein

from arxiv, Paper accepted at MICCAI 2024 conference

Medical imaging is spearheading the AI transformation of healthcare. Performance reporting is key to determine which methods should be translated into clinical practice. Frequently, broad conclusions are simply derived from mean performance values. In this paper, we argue that this common practice is often a misleading simplification as it ignores performance variability. Our contribution is threefold. (1) Analyzing all MICCAI segmentation papers (n = 221) published in 2023, we first observe that more than 50% of papers do not assess performance variability at all. Moreover, only one (0.5%) paper reported confidence intervals (CIs) for model performance. (2) To address the reporting bottleneck, we show that the unreported standard deviation (SD) in segmentation papers can be approximated by a second-order polynomial function of the mean Dice similarity coefficient (DSC). Based on external validation data from 56 previous MICCAI challenges, we demonstrate that this approximation can accurately reconstruct the CI of a method using information provided in publications. (3) Finally, we reconstructed 95% CIs around the mean DSC of MICCAI 2023 segmentation papers. The median CI width was 0.03 which is three times larger than the median performance gap between the first and second ranked method. For more than 60% of papers, the mean performance of the second-ranked method was within the CI of the first-ranked method. We conclude that current publications typically do not provide sufficient evidence to support which models could potentially be translated into clinical practice.

翻译：医学影像正引领着医疗保健领域的人工智能变革。性能报告是决定哪些方法应转化为临床实践的关键。通常，广泛的结论仅从平均性能值中简单推导得出。本文认为，这种常见做法往往是一种误导性的简化，因为它忽略了性能的变异性。我们的贡献包括三个方面。（1）通过分析2023年发表的所有MICCAI分割论文（n = 221），我们首先观察到超过50%的论文完全没有评估性能变异性。此外，仅有一篇（0.5%）论文报告了模型性能的置信区间（CIs）。（2）为解决报告瓶颈问题，我们证明了分割论文中未报告的标准差（SD）可以通过平均Dice相似系数（DSC）的二阶多项式函数来近似。基于来自56个先前MICCAI挑战的外部验证数据，我们证明这种近似能够利用出版物中提供的信息准确重建方法的CI。（3）最后，我们重建了MICCAI 2023分割论文平均DSC的95% CIs。CI宽度的中位数为0.03，这比排名第一与第二方法之间性能差距的中位数大三倍。对于超过60%的论文，排名第二方法的平均性能位于排名第一方法的CI范围内。我们得出结论：当前出版物通常未能提供足够的证据来支持哪些模型有可能转化为临床实践。