Machine learning models have achieved high overall accuracy in medical image analysis. However, performance disparities on specific patient groups pose challenges to their clinical utility, safety, and fairness. This can affect known patient groups - such as those based on sex, age, or disease subtype - as well as previously unknown and unlabeled groups. Furthermore, the root cause of such observed performance disparities is often challenging to uncover, hindering mitigation efforts. In this paper, to address these issues, we leverage Slice Discovery Methods (SDMs) to identify interpretable underperforming subsets of data and formulate hypotheses regarding the cause of observed performance disparities. We introduce a novel SDM and apply it in a case study on the classification of pneumothorax and atelectasis from chest x-rays. Our study demonstrates the effectiveness of SDMs in hypothesis formulation and yields an explanation of previously observed but unexplained performance disparities between male and female patients in widely used chest X-ray datasets and models. Our findings indicate shortcut learning in both classification tasks, through the presence of chest drains and ECG wires, respectively. Sex-based differences in the prevalence of these shortcut features appear to cause the observed classification performance gap, representing a previously underappreciated interaction between shortcut learning and model fairness analyses.
翻译:机器学习模型在医学影像分析中已实现较高的整体准确率。然而,针对特定患者群体的性能差异对其临床实用性、安全性及公平性构成挑战。这种差异既可能影响已知患者群体(如基于性别、年龄或疾病亚型划分的群体),也可能影响先前未知且未标记的群体。此外,此类性能差异的根本原因往往难以揭示,阻碍了改进措施的推进。本文针对这些问题,利用切片发现方法识别可解释的低性能数据子集,并就观测到的性能差异成因提出假设。我们提出一种新颖的切片发现方法,并将其应用于气胸与肺不张的胸部X光分类案例研究。我们的研究证实了切片发现方法在假设构建中的有效性,并对广泛使用的胸部X光数据集与模型中存在但未获解释的性别间性能差异提供了机制性阐释。研究结果表明,两种分类任务均存在通过胸导管与心电图导线实现的捷径学习现象。这些捷径特征在性别间的分布差异似乎是导致观测到的分类性能差距的原因,这揭示了捷径学习与模型公平性分析间尚未被充分认识的相互作用机制。