Trustworthy deployment of deep learning medical imaging models into real-world clinical practice requires that they be calibrated. However, models that are well calibrated overall can still be poorly calibrated for a sub-population, potentially resulting in a clinician unwittingly making poor decisions for this group based on the recommendations of the model. Although methods have been shown to successfully mitigate biases across subgroups in terms of model accuracy, this work focuses on the open problem of mitigating calibration biases in the context of medical image analysis. Our method does not require subgroup attributes during training, permitting the flexibility to mitigate biases for different choices of sensitive attributes without re-training. To this end, we propose a novel two-stage method: Cluster-Focal to first identify poorly calibrated samples, cluster them into groups, and then introduce group-wise focal loss to improve calibration bias. We evaluate our method on skin lesion classification with the public HAM10000 dataset, and on predicting future lesional activity for multiple sclerosis (MS) patients. In addition to considering traditional sensitive attributes (e.g. age, sex) with demographic subgroups, we also consider biases among groups with different image-derived attributes, such as lesion load, which are required in medical image analysis. Our results demonstrate that our method effectively controls calibration error in the worst-performing subgroups while preserving prediction performance, and outperforming recent baselines.
翻译:深度学习医学影像模型在实际临床部署中的可信赖应用要求其具备校准性。然而,整体校准良好的模型仍可能在特定子群体中校准不佳,导致临床医生基于模型推荐为该群体做出无意识的错误决策。尽管已有方法成功缓解了子群体间模型精度的偏差,本研究聚焦于医学影像分析中校准偏差这一未解决的难题。我们的方法无需在训练期间输入子群体属性,从而能够灵活地针对不同敏感性属性缓解偏差,而无需重新训练模型。为此,我们提出了一种新颖的两阶段方法:Cluster-Focal,首先识别校准不良的样本,将其聚类为群组,然后引入群组级焦点损失以改善校准偏差。我们在公开的HAM10000数据集上评估了皮肤病变分类任务,并预测了多发性硬化(MS)患者的未来病灶活动性。除考虑具有人口统计学子群体的传统敏感属性(如年龄、性别)外,我们还分析了医学影像分析中常见的基于图像衍生属性(如病灶负荷)的群体间偏差。实验结果表明,该方法在保持预测性能的同时,有效控制了表现最差子群体的校准误差,且优于近期基准方法。