In critical decision support systems based on medical imaging, the reliability of AI-assisted decision-making is as relevant as predictive accuracy. Although deep learning models have demonstrated significant accuracy, they frequently suffer from miscalibration, manifested as overconfidence in erroneous predictions. To facilitate clinical acceptance, it is imperative that models quantify uncertainty in a manner that correlates with prediction correctness, allowing clinicians to identify unreliable outputs for further review. In order to address this necessity, the present paper proposes a generalizable probabilistic optimization framework grounded in Bayesian deep learning. Specifically, a novel Confidence-Uncertainty Boundary Loss (CUB-Loss) is introduced that imposes penalties on high-certainty errors and low-certainty correct predictions, explicitly enforcing alignment between prediction correctness and uncertainty estimates. Complementing this training-time optimization, a Dual Temperature Scaling (DTS) strategy is devised for post-hoc calibration, further refining the posterior distribution to improve intuitive explainability. The proposed framework is validated on three distinct medical imaging tasks: automatic screening of pneumonia, diabetic retinopathy detection, and identification of skin lesions. Empirical results demonstrate that the proposed approach achieves consistent calibration improvements across diverse modalities, maintains robust performance in data-scarce scenarios, and remains effective on severely imbalanced datasets, underscoring its potential for real clinical deployment.
翻译:在基于医学影像的关键决策支持系统中,AI辅助决策的可靠性与预测准确性同等重要。尽管深度学习模型已展现出显著的准确性,但它们常存在校准不足的问题,表现为对错误预测的过度自信。为促进临床接受度,模型必须以与预测正确性相关的方式量化不确定性,使临床医生能够识别不可靠的输出以进行进一步审查。为应对这一需求,本文提出了一种基于贝叶斯深度学习的可推广概率优化框架。具体而言,我们引入了一种新颖的置信度-不确定性边界损失函数(CUB-Loss),该函数对高确定性错误和低确定性正确预测施加惩罚,显式地强制预测正确性与不确定性估计之间的一致性。作为训练时优化的补充,我们设计了一种双温度缩放(DTS)策略用于事后校准,进一步细化后验分布以提升直观可解释性。所提出的框架在三个不同的医学影像任务上得到验证:肺炎自动筛查、糖尿病视网膜病变检测和皮肤病变识别。实证结果表明,所提方法在不同模态下均实现了持续的校准改进,在数据稀缺场景中保持稳健性能,并在严重不平衡的数据集上依然有效,凸显了其在实际临床部署中的潜力。