Healthcare disparities persist across socioeconomic boundaries, often attributed to unequal access to screening, diagnostics, and therapeutics. However, this perspective highlights that critical biases can emerge much earlier, during data collection and research prioritization, long before clinical implementation, particularly in studies focused on molecular and omics data. A vast number of studies focus on collecting omics data, but the demographic information associated with these datasets is often not reported, and when it is reported, it reveals substantial biases. An automated analysis of 4514 PubMed-indexed omics publications from 2015 to 2024, examining reporting across multiple demographic dimensions, reveals limited reporting overall; for example, only 2.7% of studies report ancestry or ethnicity information and geographic origin reporting is limited to 2.5%. Analysis of large-scale datasets commonly used for model training, such as CellxGene and GEO, reveals substantial population bias where European-ancestry data dominates. As biomedical foundation models become central to biomedical discovery with a paradigm in which base models are pretrained on large datasets and reusing them repeatedly for many different downstream tasks, they risk perpetuating or amplifying these early-stage biases, leading to cascading inequities that regulatory interventions cannot fully reverse. We propose a community-wide focus on three foundational principles: Provenance, Openness, and Reliability through Evaluation Transparency. Together, these principles can help make biases and limitations more visible to model developers and users, supporting more informed model development, evaluation, and deployment decisions in biomedical AI.
翻译:医疗差距在跨越社会经济界限时持续存在,常归因于筛查、诊断和治疗的获取不平等。然而,本视角强调关键偏见可能更早出现,即在数据收集和研究优先级确定阶段,远早于临床应用,尤其是在专注于分子和组学数据的研究中。大量研究聚焦于收集组学数据,但与此类数据集相关的人口统计学信息通常未被报告,即便报告,也显示出显著偏见。对2015年至2024年间4514篇PubMed索引的组学出版物的自动化分析,考察了多个维度的报告情况,结果显示整体报告有限:例如,仅有2.7%的研究报告了祖先或族裔信息,地理来源报告仅限于2.5%。对常用于模型训练的大规模数据集(如CellxGene和GEO)的分析揭示了显著的人口偏差,其中欧洲血统数据占主导地位。随着生物医学基础模型成为生物医学发现的核心,采用预训练于大数据集并反复用于多种下游任务的范式,它们有可能会延续或放大这些早期阶段的偏见,导致即使监管干预也无法完全扭转的级联不平等。我们提议在社区范围内聚焦三大基本原则:溯源、开放性和通过评估透明度保障可靠性。这些原则共同有助于使偏见和局限性对模型开发者和用户更加明确,从而支持生物医学人工智能中更明智的模型开发、评估和部署决策。