FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models

The advent of foundation models (FMs) in healthcare offers unprecedented opportunities to enhance medical diagnostics through automated classification and segmentation tasks. However, these models also raise significant concerns about their fairness, especially when applied to diverse and underrepresented populations in healthcare applications. Currently, there is a lack of comprehensive benchmarks, standardized pipelines, and easily adaptable libraries to evaluate and understand the fairness performance of FMs in medical imaging, leading to considerable challenges in formulating and implementing solutions that ensure equitable outcomes across diverse patient populations. To fill this gap, we introduce FairMedFM, a fairness benchmark for FM research in medical imaging.FairMedFM integrates with 17 popular medical imaging datasets, encompassing different modalities, dimensionalities, and sensitive attributes. It explores 20 widely used FMs, with various usages such as zero-shot learning, linear probing, parameter-efficient fine-tuning, and prompting in various downstream tasks -- classification and segmentation. Our exhaustive analysis evaluates the fairness performance over different evaluation metrics from multiple perspectives, revealing the existence of bias, varied utility-fairness trade-offs on different FMs, consistent disparities on the same datasets regardless FMs, and limited effectiveness of existing unfairness mitigation methods. Checkout FairMedFM's project page and open-sourced codebase, which supports extendible functionalities and applications as well as inclusive for studies on FMs in medical imaging over the long term.

翻译：医疗领域基础模型的兴起为通过自动化分类与分割任务提升医学诊断水平带来了前所未有的机遇。然而，这些模型也引发了对其公平性的重大关切，尤其是在应用于医疗健康领域中多样化和代表性不足的人群时。目前，缺乏全面的基准测试、标准化流程和易于适配的库来评估和理解医学影像中基础模型的公平性表现，这导致在制定和实施确保不同患者群体获得公平结果的解决方案方面存在显著挑战。为填补这一空白，我们提出了FairMedFM——一个面向医学影像基础模型研究的公平性基准测试框架。FairMedFM整合了17个流行的医学影像数据集，涵盖不同模态、维度和敏感属性。它探索了20个广泛使用的基础模型，涉及多种应用方式，如零样本学习、线性探测、参数高效微调以及在分类与分割等多种下游任务中的提示学习。我们通过多角度、多评价指标的全面分析评估了公平性表现，揭示了偏倚的存在、不同基础模型上效用与公平性的权衡差异、同一数据集上不同模型间一致的性能差距，以及现有不公平缓解方法的有限效果。欢迎访问FairMedFM的项目页面和开源代码库，其支持可扩展的功能与应用，并为医学影像基础模型的长期研究提供包容性支持。