Perspective on Bias in Biomedical AI: Preventing Downstream Healthcare Disparities

Healthcare disparities persist across socioeconomic boundaries, often attributed to unequal access to screening, diagnostics, and therapeutics. However, this perspective highlights that critical biases can emerge much earlier, during data collection and research prioritization, long before clinical implementation in cases where the focus of the studies and the data that is collected is at the molecular level. A vast number of studies focus on collecting omics data but the demographic information associated with these datasets is often not reported in the studies, and when it is reported, it shows big biases. An automated analysis of 4719 PubMed-indexed omics publications from 2015 to 2024 reveals that only a small fraction report ancestry or ethnicity information, with ancestry reporting improving slightly. Analysis of large-scale datasets commonly used for model training, such as CellxGene and GEO, reveals substantial population bias where European-ancestry data dominates. As biomedical foundation models become central to biomedical discovery with a paradigm in which base models are pretrained on large datasets and reusing them time and again for many different downstream tasks, they risk perpetuating or amplifying these early-stage biases, leading to cascading inequities that regulatory interventions cannot fully reverse. We propose a community-wide focus on three foundational principles: Provenance, Openness, and Evaluation Transparency to improve equity and robustness in biomedical AI. This approach aims to foster biomedical innovation that more effectively serves underserved populations and improves health outcomes.

翻译：医疗健康差距持续存在于不同社会经济群体之间,常被归因于筛查、诊断和治疗机会的不平等。然而,本视角强调,关键偏见可能在更早阶段就已产生——在数据收集和研究优先级确定过程中,远早于临床实施阶段,且当研究重点和数据收集集中在分子层面时尤为突出。大量研究专注于收集组学数据,但相关数据集的人口统计学信息往往未在研究报告中提及,即便报告了也显示出显著偏差。对2015至2024年间PubMed收录的4719篇组学出版物进行自动化分析发现,仅有少数研究报告了祖先或种族信息,虽然祖先信息的报告率略有提升。对CellxGene和GEO等常用于模型训练的大规模数据集分析显示存在显著的人群偏差,欧洲血统数据占据主导地位。随着生物医学基础模型成为生物医学发现的核心(其范式是基座模型在大规模数据集上预训练后反复重用于多种下游任务),这些模型可能延续或放大早期阶段的偏见,导致级联式不平等,且监管干预无法完全逆转。我们提出社区应聚焦三大基本原则:溯源、开放与评估透明度,以提高生物医学AI的公平性和鲁棒性。这一方法旨在促进能够更有效服务弱势群体、改善健康结局的生物医学创新。