High-quality biomedical datasets are essential for medical research and disease treatment innovation. The NIH-funded Bridge2AI project strives to facilitate such innovations by uniting top-tier, diverse teams to curate datasets designed for AI-driven biomedical research. We examined 1,699 dataset papers from the Nucleic Acids Research (NAR) database issues and the Bridge2AI Talent Knowledge Graph. By treating each paper's authors as a team, we explored the relationship between team attributes (team power and fairness) and dataset paper quality, measured by scientific impact (Relative Citation Ratio percentile) and clinical translation power (APT, likelihood of citation by clinical trials and guidelines). Utilizing the SHAP explainable AI framework, we identified correlations between team attributes and the success of dataset papers in both citation impact and clinical translation. Key findings reveal that (1) PI (Principal Investigator) leadership and team academic prowess are strong predictors of dataset success; (2) team size and career age are positively correlated with scientific impact but show inverse patterns for clinical translation; and (3) higher female representation correlates with greater dataset success. Although our results are correlational, they offer valuable insights into forming high-performing data generation teams. Future research should incorporate causal frameworks to deepen understanding of these relationships.
翻译:高质量的生物医学数据集对于医学研究和疾病治疗创新至关重要。美国国立卫生研究院(NIH)资助的Bridge2AI项目致力于通过汇聚顶尖、多元化的团队来策划专为人工智能驱动的生物医学研究设计的数据集,以推动此类创新。我们分析了来自《核酸研究》(NAR)数据库特刊和Bridge2AI人才知识图谱的1,699篇数据集论文。通过将每篇论文的作者视为一个团队,我们探讨了团队属性(团队实力与公平性)与数据集论文质量之间的关系,其中质量通过科学影响力(相对引用率百分位数)和临床转化能力(APT,即被临床试验和指南引用的可能性)来衡量。利用SHAP可解释人工智能框架,我们识别了团队属性与数据集论文在引用影响力和临床转化方面成功之间的相关性。主要发现表明:(1)首席研究员(PI)的领导力和团队学术实力是预测数据集成功的有力指标;(2)团队规模和成员职业年龄与科学影响力呈正相关,但在临床转化方面呈现相反模式;(3)更高的女性成员比例与数据集更大的成功相关。尽管我们的结果是相关性的,但它们为组建高性能的数据生成团队提供了宝贵的见解。未来的研究应纳入因果框架以深化对这些关系的理解。