FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models

The advent of foundation models (FMs) in healthcare offers unprecedented opportunities to enhance medical diagnostics through automated classification and segmentation tasks. However, these models also raise significant concerns about their fairness, especially when applied to diverse and underrepresented populations in healthcare applications. Currently, there is a lack of comprehensive benchmarks, standardized pipelines, and easily adaptable libraries to evaluate and understand the fairness performance of FMs in medical imaging, leading to considerable challenges in formulating and implementing solutions that ensure equitable outcomes across diverse patient populations. To fill this gap, we introduce FairMedFM, a fairness benchmark for FM research in medical imaging.FairMedFM integrates with 17 popular medical imaging datasets, encompassing different modalities, dimensionalities, and sensitive attributes. It explores 20 widely used FMs, with various usages such as zero-shot learning, linear probing, parameter-efficient fine-tuning, and prompting in various downstream tasks -- classification and segmentation. Our exhaustive analysis evaluates the fairness performance over different evaluation metrics from multiple perspectives, revealing the existence of bias, varied utility-fairness trade-offs on different FMs, consistent disparities on the same datasets regardless FMs, and limited effectiveness of existing unfairness mitigation methods. Checkout FairMedFM's project page and open-sourced codebase, which supports extendible functionalities and applications as well as inclusive for studies on FMs in medical imaging over the long term.

翻译：医疗领域基础模型的出现，为通过自动化分类与分割任务提升医学诊断水平带来了前所未有的机遇。然而，这些模型也引发了对其公平性的重大关切，尤其是在应用于医疗健康领域中多样化和代表性不足的人群时。目前，缺乏全面的基准测试、标准化的流程以及易于适配的库来评估和理解医学影像中基础模型的公平性表现，这导致在制定和实施确保不同患者群体间公平结果的解决方案方面存在显著挑战。为填补这一空白，我们提出了FairMedFM，一个用于医学影像基础模型研究的公平性基准。FairMedFM整合了17个流行的医学影像数据集，涵盖不同模态、维度和敏感属性。它探索了20个广泛使用的基础模型，涉及多种使用方式，如零样本学习、线性探测、参数高效微调以及在下游任务（分类与分割）中的提示技术。我们通过详尽的评估，从多个角度采用不同的评价指标来衡量公平性表现，揭示了偏见的存在、不同基础模型上效用与公平性的权衡差异、同一数据集上无论使用何种基础模型均存在的一致性差异，以及现有不公平缓解方法的有限有效性。请访问FairMedFM的项目页面和开源代码库，其支持可扩展的功能与应用，并长期致力于促进医学影像基础模型研究的包容性发展。