Investigating Demographic Bias in Brain MRI Segmentation: A Comparative Study of Deep-Learning and Non-Deep-Learning Methods

from arxiv, 17 pages, 2 figures, Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2025:035

Deep-learning-based segmentation algorithms have substantially advanced the field of medical image analysis, particularly in structural delineations in MRIs. However, an important consideration is the intrinsic bias in the data. Concerns about unfairness, such as performance disparities based on sensitive attributes like race and sex, are increasingly urgent. In this work, we evaluate the results of three different segmentation models (UNesT, nnU-Net, and CoTr) and a traditional atlas-based method (ANTs), applied to segment the left and right nucleus accumbens (NAc) in MRI images. We utilize a dataset including four demographic subgroups: black female, black male, white female, and white male. We employ manually labeled gold-standard segmentations to train and test segmentation models. This study consists of two parts: the first assesses the segmentation performance of models, while the second measures the volumes they produce to evaluate the effects of race, sex, and their interaction. Fairness is quantitatively measured using a metric designed to quantify fairness in segmentation performance. Additionally, linear mixed models analyze the impact of demographic variables on segmentation accuracy and derived volumes. Training on the same race as the test subjects leads to significantly better segmentation accuracy for some models. ANTs and UNesT show notable improvements in segmentation accuracy when trained and tested on race-matched data, unlike nnU-Net, which demonstrates robust performance independent of demographic matching. Finally, we examine sex and race effects on the volume of the NAc using segmentations from the manual rater and from our biased models. Results reveal that the sex effects observed with manual segmentation can also be observed with biased models, whereas the race effects disappear in all but one model.

翻译：基于深度学习的分割算法显著推动了医学影像分析领域的发展，尤其在磁共振成像（MRI）的结构描绘方面。然而，一个重要考量是数据中存在的内在偏差。关于不公平性的担忧，例如基于种族和性别等敏感属性的性能差异，正日益紧迫。在本研究中，我们评估了三种不同分割模型（UNesT、nnU-Net和CoTr）以及一种传统的基于图谱方法（ANTs）在MRI图像中分割左右伏隔核（NAc）的结果。我们使用的数据集包含四个人口统计学亚组：黑人女性、黑人男性、白人女性和白人男性。我们采用手动标注的金标准分割来训练和测试分割模型。本研究包含两部分：第一部分评估模型的分割性能，第二部分测量模型产生的体积以评估种族、性别及其交互效应的影响。公平性通过一个专门设计用于量化分割性能公平性的指标进行定量测量。此外，线性混合模型分析了人口统计学变量对分割准确性和衍生体积的影响。对于某些模型，使用与测试对象相同种族的数据进行训练能显著提高分割准确性。ANTs和UNesT在种族匹配数据上训练和测试时显示出分割准确性的显著提升，而nnU-Net则表现出不依赖于人口统计学匹配的稳健性能。最后，我们利用手动标注者及我们带有偏差的模型产生的分割结果，检验了性别和种族对NAc体积的影响。结果显示，手动分割观察到的性别效应在带有偏差的模型中同样可以观察到，而种族效应则除一个模型外，在其他所有模型中均消失。