Grading precancerous lesions on whole slide images is a challenging task: the continuous space of morphological phenotypes makes clear-cut decisions between different grades often difficult, leading to low inter- and intra-rater agreements. More and more Artificial Intelligence (AI) algorithms are developed to help pathologists perform and standardize their diagnosis. However, those models can render their prediction without consideration of the ambiguity of the classes and can fail without notice which prevent their wider acceptance in a clinical context. In this paper, we propose a new score to measure the confidence of AI models in grading tasks. Our confidence score is specifically adapted to ordinal output variables, is versatile and does not require extra training or additional inferences nor particular architecture changes. Comparison to other popular techniques such as Monte Carlo Dropout and deep ensembles shows that our method provides state-of-the art results, while being simpler, more versatile and less computationally intensive. The score is also easily interpretable and consistent with real life hesitations of pathologists. We show that the score is capable of accurately identifying mispredicted slides and that accuracy for high confidence decisions is significantly higher than for low-confidence decisions (gap in AUC of 17.1% on the test set). We believe that the proposed confidence score could be leveraged by pathologists directly in their workflow and assist them on difficult tasks such as grading precancerous lesions.
翻译:在全切片图像上对癌前病变进行分级是一项具有挑战性的任务:形态表型的连续空间使得不同等级之间的明确决策往往困难重重,导致评估者间和评估者自身的一致性较低。越来越多的基于人工智能的算法被开发出来,以帮助病理学家执行并标准化其诊断。然而,这些模型可能在未考虑类别模糊性的情况下做出预测,且可能在没有预警的情况下失效,这阻碍了它们在临床环境中的更广泛接受。在本文中,我们提出了一种新的评分方法,用于衡量人工智能模型在分级任务中的置信度。我们的置信度评分专门针对有序输出变量进行了适配,具有通用性,且无需额外训练、额外推理或特定的架构更改。与蒙特卡洛丢弃法和深度集成等其他流行技术的比较表明,我们的方法在提供最先进结果的同时,更简单、更通用且计算强度更低。该评分也易于解释,与病理学家在实际情况中的犹豫具有一致性。我们证明,该评分能够准确识别误预测的切片,并且高置信度决策的准确性显著高于低置信度决策(在测试集上AUC差距为17.1%)。我们相信,所提出的置信度评分可直接被病理学家纳入其工作流程,并协助他们完成如癌前病变分级等困难任务。