Clinical Melanoma Diagnosis with Artificial Intelligence: Insights from a Prospective Multicenter Study

Lukas Heinlein,Roman C. Maron,Achim Hekler,Sarah Haggenmüller,Christoph Wies,Jochen S. Utikal,Friedegund Meier,Sarah Hobelsberger,Frank F. Gellrich,Mildred Sergon,Axel Hauschild,Lars E. French,Lucie Heinzerling,Justin G. Schlager,Kamran Ghoreschi,Max Schlaak,Franz J. Hilke,Gabriela Poch,Sören Korsing,Carola Berking,Markus V. Heppt,Michael Erdmann,Sebastian Haferkamp,Konstantin Drexler,Dirk Schadendorf,Wiebke Sondermann,Matthias Goebeler,Bastian Schilling,Eva Krieghoff-Henning,Titus J. Brinker

Early detection of melanoma, a potentially lethal type of skin cancer with high prevalence worldwide, improves patient prognosis. In retrospective studies, artificial intelligence (AI) has proven to be helpful for enhancing melanoma detection. However, there are few prospective studies confirming these promising results. Existing studies are limited by low sample sizes, too homogenous datasets, or lack of inclusion of rare melanoma subtypes, preventing a fair and thorough evaluation of AI and its generalizability, a crucial aspect for its application in the clinical setting. Therefore, we assessed 'All Data are Ext' (ADAE), an established open-source ensemble algorithm for detecting melanomas, by comparing its diagnostic accuracy to that of dermatologists on a prospectively collected, external, heterogeneous test set comprising eight distinct hospitals, four different camera setups, rare melanoma subtypes, and special anatomical sites. We advanced the algorithm with real test-time augmentation (R-TTA, i.e. providing real photographs of lesions taken from multiple angles and averaging the predictions), and evaluated its generalization capabilities. Overall, the AI showed higher balanced accuracy than dermatologists (0.798, 95% confidence interval (CI) 0.779-0.814 vs. 0.781, 95% CI 0.760-0.802; p<0.001), obtaining a higher sensitivity (0.921, 95% CI 0.900- 0.942 vs. 0.734, 95% CI 0.701-0.770; p<0.001) at the cost of a lower specificity (0.673, 95% CI 0.641-0.702 vs. 0.828, 95% CI 0.804-0.852; p<0.001). As the algorithm exhibited a significant performance advantage on our heterogeneous dataset exclusively comprising melanoma-suspicious lesions, AI may offer the potential to support dermatologists particularly in diagnosing challenging cases.

翻译：黑色素瘤是一种全球高发的潜在致命性皮肤癌，其早期发现可改善患者预后。回顾性研究已证实人工智能（AI）有助于提升黑色素瘤检出率，但前瞻性研究对其积极结果的验证仍相对匮乏。现有研究存在样本量较小、数据集同质性过高或缺乏罕见黑色素瘤亚型纳入等局限性，这阻碍了对AI及其泛化能力（临床应用的关键要素）进行公平而全面的评估。为此，我们通过比较‘All Data are Ext’（ADAE）算法（一种成熟的用于检测黑色素瘤的开源集成算法）与皮肤科医生在前瞻性收集的外部异质性测试集上的诊断准确性开展评估。该测试集涵盖八家不同医院、四种不同摄像设备、罕见黑色素瘤亚型及特殊解剖部位。我们采用实时测试增强技术（R-TTA，即提供从多角度拍摄的皮损真实照片并取预测均值）对该算法进行优化，并评估其泛化能力。总体而言，AI的平衡准确率（0.798，95%置信区间[CI] 0.779-0.814）显著高于皮肤科医生（0.781，95% CI 0.760-0.802；p<0.001），其灵敏度更高（0.921，95% CI 0.900-0.942 vs. 0.734，95% CI 0.701-0.770；p<0.001），但特异性较低（0.673，95% CI 0.641-0.702 vs. 0.828，95% CI 0.804-0.852；p<0.001）。鉴于该算法在仅包含黑色素瘤可疑皮损的异质性数据集上展现出显著性能优势，AI或可为皮肤科医生提供诊断支持，特别是在处理疑难病例时发挥潜力。