An Empirical Study on the Fairness of Foundation Models for Multi-Organ Image Segmentation

The segmentation foundation model, e.g., Segment Anything Model (SAM), has attracted increasing interest in the medical image community. Early pioneering studies primarily concentrated on assessing and improving SAM's performance from the perspectives of overall accuracy and efficiency, yet little attention was given to the fairness considerations. This oversight raises questions about the potential for performance biases that could mirror those found in task-specific deep learning models like nnU-Net. In this paper, we explored the fairness dilemma concerning large segmentation foundation models. We prospectively curate a benchmark dataset of 3D MRI and CT scans of the organs including liver, kidney, spleen, lung and aorta from a total of 1056 healthy subjects with expert segmentations. Crucially, we document demographic details such as gender, age, and body mass index (BMI) for each subject to facilitate a nuanced fairness analysis. We test state-of-the-art foundation models for medical image segmentation, including the original SAM, medical SAM and SAT models, to evaluate segmentation efficacy across different demographic groups and identify disparities. Our comprehensive analysis, which accounts for various confounding factors, reveals significant fairness concerns within these foundational models. Moreover, our findings highlight not only disparities in overall segmentation metrics, such as the Dice Similarity Coefficient but also significant variations in the spatial distribution of segmentation errors, offering empirical evidence of the nuanced challenges in ensuring fairness in medical image segmentation.

翻译：分割基础模型，例如Segment Anything Model (SAM)，在医学影像领域引起了日益广泛的关注。早期的开创性研究主要从整体精度和效率的角度评估和改进SAM的性能，却鲜少关注其公平性考量。这一疏忽引发了对其可能存在性能偏差的质疑，此类偏差可能与nnU-Net等任务特异性深度学习模型中发现的偏差相似。本文深入探讨了大型分割基础模型面临的公平性困境。我们前瞻性地构建了一个包含1056名健康受试者的三维MRI与CT扫描基准数据集，涵盖肝脏、肾脏、脾脏、肺及主动脉等器官，并配有专家标注的分割结果。关键的是，我们记录了每位受试者的性别、年龄和身体质量指数（BMI）等人口统计学信息，以支持细致的公平性分析。我们测试了包括原始SAM、医学SAM及SAT模型在内的前沿医学图像分割基础模型，评估其在不同人口统计学群体中的分割效能并识别差异。通过控制多种混杂因素的综合分析，我们揭示了这些基础模型中存在的显著公平性问题。此外，我们的研究不仅发现了整体分割指标（如Dice相似系数）的差异，还揭示了分割误差空间分布的显著变异，为医学图像分割领域实现公平性所面临的复杂挑战提供了实证依据。