Electronic health records (EHR) data have considerable variability in data completeness across sites and patients. Lack of "EHR data-continuity" or "EHR data-discontinuity", defined as "having medical information recorded outside the reach of an EHR system" can lead to a substantial amount of information bias. The objective of this study was to comprehensively evaluate (1) how EHR data-discontinuity introduces data bias, (2) case finding algorithms affect downstream prediction models, and (3) how algorithmic fairness is associated with racial-ethnic disparities. We leveraged our EHRs linked with Medicaid and Medicare claims data in the OneFlorida+ network and used a validated measure (i.e., Mean Proportions of Encounters Captured [MPEC]) to estimate patients' EHR data continuity. We developed a machine learning model for predicting type 2 diabetes (T2D) diagnosis as the use case for this work. We found that using cohorts selected by different levels of EHR data-continuity affects utilities in disease prediction tasks. The prediction models trained on high continuity data will have a worse fit on low continuity data. We also found variations in racial and ethnic disparities in model performances and model fairness in models developed using different degrees of data continuity. Our results suggest that careful evaluation of data continuity is critical to improving the validity of real-world evidence generated by EHR data and health equity.
翻译:电子健康档案数据在不同机构和患者间的数据完整性存在显著差异。缺乏"电子健康档案数据连续性"或存在"电子健康档案数据间断性"——即"医疗信息记录在电子健康档案系统可及范围之外"——可能导致严重的信息偏倚。本研究旨在系统评估:(1)电子健康档案数据间断性如何引发数据偏倚;(2)病例识别算法如何影响下游预测模型;(3)算法公平性与种族-民族差异的关联机制。我们利用OneFlorida+网络中链接医疗补助与医疗保险索赔数据的电子健康档案,采用经过验证的测量指标(即平均就诊捕获比例)来评估患者电子健康档案数据连续性。本研究以构建2型糖尿病诊断预测的机器学习模型作为应用案例。研究发现:采用不同电子健康档案数据连续性水平筛选的队列会影响疾病预测任务的效用;基于高连续性数据训练的预测模型在低连续性数据上表现显著劣化;在不同数据连续性条件下开发的模型,其性能表现与模型公平性均存在种族和民族差异。研究结果表明:审慎评估数据连续性对于提升电子健康档案数据生成的真实世界证据效度及促进健康公平具有关键意义。