The impact of electronic health records (EHR) data continuity on prediction model fairness and racial-ethnic disparities

from arxiv, Substantial revision planned: We're preparing a significantly revised version and prefer to withdraw the current one rather than leave a misleading draft up

Electronic health records (EHR) data have considerable variability in data completeness across sites and patients. Lack of "EHR data-continuity" or "EHR data-discontinuity", defined as "having medical information recorded outside the reach of an EHR system" can lead to a substantial amount of information bias. The objective of this study was to comprehensively evaluate (1) how EHR data-discontinuity introduces data bias, (2) case finding algorithms affect downstream prediction models, and (3) how algorithmic fairness is associated with racial-ethnic disparities. We leveraged our EHRs linked with Medicaid and Medicare claims data in the OneFlorida+ network and used a validated measure (i.e., Mean Proportions of Encounters Captured [MPEC]) to estimate patients' EHR data continuity. We developed a machine learning model for predicting type 2 diabetes (T2D) diagnosis as the use case for this work. We found that using cohorts selected by different levels of EHR data-continuity affects utilities in disease prediction tasks. The prediction models trained on high continuity data will have a worse fit on low continuity data. We also found variations in racial and ethnic disparities in model performances and model fairness in models developed using different degrees of data continuity. Our results suggest that careful evaluation of data continuity is critical to improving the validity of real-world evidence generated by EHR data and health equity.

翻译：电子健康档案数据在不同机构和患者间的数据完整性存在显著差异。缺乏"电子健康档案数据连续性"或存在"电子健康档案数据间断性"——即"医疗信息记录在电子健康档案系统可及范围之外"——可能导致严重的信息偏倚。本研究旨在系统评估：(1)电子健康档案数据间断性如何引发数据偏倚；(2)病例识别算法如何影响下游预测模型；(3)算法公平性与种族-民族差异的关联机制。我们利用OneFlorida+网络中链接医疗补助与医疗保险索赔数据的电子健康档案，采用经过验证的测量指标（即平均就诊捕获比例）来评估患者电子健康档案数据连续性。本研究以构建2型糖尿病诊断预测的机器学习模型作为应用案例。研究发现：采用不同电子健康档案数据连续性水平筛选的队列会影响疾病预测任务的效用；基于高连续性数据训练的预测模型在低连续性数据上表现显著劣化；在不同数据连续性条件下开发的模型，其性能表现与模型公平性均存在种族和民族差异。研究结果表明：审慎评估数据连续性对于提升电子健康档案数据生成的真实世界证据效度及促进健康公平具有关键意义。

相关内容

健康

关注 27

健康是指一个人在身体、精神和社会等方面都处于良好的状态。健康包括两个方面的内容：

一是主要脏器无疾病，身体形态发育良好，体形均匀，人体各系统具有良好的生理功能，有较强的身体活动能力和劳动能力，这是对健康最基本的要求；

二是对疾病的抵抗能力较强，能够适应环境变化，各种生理刺激以及致病因素对身体的作用。传统的健康观是“无病即健康”，现代人的健康观是整体健康，世界卫生组织提出“健康不仅是躯体没有疾病，还要具备心理健康、社会适应良好和有道德”。因此，现代人的健康内容包括：躯体健康、心理健康、心灵健康、社会健康、智力健康、道德健康、环境健康等。健康是人的基本权利。健康是人生的第一财富。

利用表示学习推动多机构电子健康记录数据研究

专知会员服务

16+阅读 · 2025年2月17日

【牛津大学博士论文】面向电子健康记录的深度学习:风险预测、可解释性和不确定性，200页pdf

专知会员服务

46+阅读 · 2023年7月18日

【Scientific Reports】《多中心影像诊断的联邦学习：心血管疾病的模拟研究》

专知会员服务

20+阅读 · 2022年8月4日

《用于医疗数据的分析和机器学习》佐治亚理工学院137页博士论文

专知会员服务

26+阅读 · 2022年7月21日