Using administrative patient-care data such as Electronic Health Records (EHR) and medical/ pharmaceutical claims for population-based scientific research has become increasingly common. With vast sample sizes leading to very small standard errors, researchers need to pay more attention to potential biases in the estimates of association parameters of interest, specifically to biases that do not diminish with increasing sample size. Of these multiple sources of biases, in this paper, we focus on understanding selection bias. We present an analytic framework using directed acyclic graphs for guiding applied researchers to dissect how different sources of selection bias may affect estimates of the association between a binary outcome and an exposure (continuous or categorical) of interest. We consider four easy-to-implement weighting approaches to reduce selection bias with accompanying variance formulae. We demonstrate through a simulation study when they can rescue us in practice with analysis of real world data. We compare these methods using a data example where our goal is to estimate the well-known association of cancer and biological sex, using EHR from a longitudinal biorepository at the University of Michigan Healthcare system. We provide annotated R codes to implement these weighted methods with associated inference.
翻译:基于行政性患者护理数据(如电子健康记录和医疗/药物索赔数据)开展人群科学研究已日益普遍。由于样本量巨大导致标准误差极小,研究者需更加关注目标关联参数估计中的潜在偏倚,特别是那些不会随样本量增加而减小的偏倚。在多重偏倚来源中,本文聚焦于理解选择性偏倚。我们提出一种基于有向无环图的分析框架,指导应用研究者剖析不同来源的选择性偏倚如何影响二分类结局与暴露(连续或分类变量)之间关联的估计值。我们考虑四种易于实施的加权方法以减少选择性偏倚,并提供相应的方差公式。通过模拟研究,我们证明这些方法在实际真实世界数据分析中如何发挥作用。利用密歇根大学医疗系统纵向生物样本库中的电子健康记录数据,我们以估计癌症与生物学性别之间的已知关联为目标,通过数据实例比较了这些方法。我们提供了带注释的R语言代码,用于实施这些加权方法及其相关推断。