Quantifying uncertainty of predictions has been identified as one way to develop more trustworthy artificial intelligence (AI) models beyond conventional reporting of performance metrics. When considering their role in a clinical decision support setting, AI classification models should ideally avoid confident wrong predictions and maximise the confidence of correct predictions. Models that do this are said to be well-calibrated with regard to confidence. However, relatively little attention has been paid to how to improve calibration when training these models, i.e., to make the training strategy uncertainty-aware. In this work we evaluate three novel uncertainty-aware training strategies comparing against two state-of-the-art approaches. We analyse performance on two different clinical applications: cardiac resynchronisation therapy (CRT) response prediction and coronary artery disease (CAD) diagnosis from cardiac magnetic resonance (CMR) images. The best-performing model in terms of both classification accuracy and the most common calibration measure, expected calibration error (ECE) was the Confidence Weight method, a novel approach that weights the loss of samples to explicitly penalise confident incorrect predictions. The method reduced the ECE by 17% for CRT response prediction and by 22% for CAD diagnosis when compared to a baseline classifier in which no uncertainty-aware strategy was included. In both applications, as well as reducing the ECE there was a slight increase in accuracy from 69% to 70% and 70% to 72% for CRT response prediction and CAD diagnosis respectively. However, our analysis showed a lack of consistency in terms of optimal models when using different calibration measures. This indicates the need for careful consideration of performance metrics when training and selecting models for complex high-risk applications in healthcare.
翻译:量化预测的不确定性已被认为是超越传统性能指标报告、开发更可信人工智能(AI)模型的一种途径。在临床决策支持场景中考虑其作用时,AI分类模型应理想地避免高置信度的错误预测,并最大化正确预测的置信度。做到这一点的模型被认为在置信度方面具有良好校准性。然而,在训练这些模型时如何改进校准(即让训练策略具有不确定性感知)却鲜少受到关注。本研究评估了三种新颖的不确定性感知训练策略,并与两种现有最优方法进行了比较。我们分析了两个不同临床应用场景的性能:基于心脏磁共振(CMR)图像的心脏再同步化治疗(CRT)反应预测和冠状动脉疾病(CAD)诊断。在分类准确率和最常用的校准度量——期望校准误差(ECE)方面,表现最佳的模型是置信度加权法,这是一种通过对样本损失加权来明确惩罚高置信度错误预测的新方法。与未采用任何不确定性感知策略的基线分类器相比,该方法将CRT反应预测的ECE降低了17%,并将CAD诊断的ECE降低了22%。在两种应用中,除了降低ECE外,准确率也略有提升:CRT反应预测从69%提高到70%,CAD诊断从70%提高到72%。然而,本分析表明,在使用不同校准度量时,最优模型缺乏一致性。这提示在医疗保健领域针对复杂高风险应用训练和选择模型时,需仔细考量性能指标。