A Doubly Robust Framework for Addressing Outcome-Dependent Selection Bias in Multi-Cohort EHR Studies

Selection bias can hinder accurate estimation of association parameters in binary disease risk models using non-probability samples like electronic health records (EHRs). The issue is compounded when participants are recruited from multiple clinics/centers with varying selection mechanisms that may depend on the disease/outcome of interest. Traditional inverse-probability-weighted (IPW) methods, based on constructed parametric selection models, often struggle with misspecifications when selection mechanisms vary across cohorts. This paper introduces a new Joint Augmented Inverse Probability Weighted (JAIPW) method, which integrates individual-level data from multiple cohorts collected under potentially outcome-dependent selection mechanisms, with data from an external probability sample. JAIPW offers double robustness by incorporating a flexible auxiliary score model to address potential misspecifications in the selection models. We outline the asymptotic properties of the JAIPW estimator, and our simulations reveal that JAIPW achieves up to six times lower relative bias and five times lower root mean square error (RMSE) compared to the best performing joint IPW methods under scenarios with misspecified selection models. Applying JAIPW to the Michigan Genomics Initiative (MGI), a multi-clinic EHR-linked biobank, combined with external national probability samples, resulted in cancer-sex association estimates closely aligned with national benchmark estimates. We also analyzed the association between cancer and polygenic risk scores (PRS) in MGI to illustrate a situation where the exposure variable is not measured in the external probability sample.

翻译：在使用电子健康记录等非概率样本构建二元疾病风险模型时，选择偏倚可能阻碍关联参数的准确估计。当参与者从具有不同选择机制（可能依赖于目标疾病/结果）的多家诊所/中心招募时，该问题会进一步加剧。传统的逆概率加权方法基于构建的参数化选择模型，当选择机制在队列间变化时，常因模型误设而效果不佳。本文提出了一种新的联合增强逆概率加权方法，该方法整合了在潜在结果依赖性选择机制下收集的多个队列的个体水平数据，以及来自外部概率样本的数据。JAIPW通过纳入灵活的辅助评分模型来解决选择模型可能存在的误设问题，从而具备双重稳健性。我们阐述了JAIPW估计量的渐近性质，模拟研究表明，在选择模型误设的场景下，与表现最佳的联合IPW方法相比，JAIPW的相对偏倚最多可降低六倍，均方根误差最多可降低五倍。将JAIPW应用于密歇根基因组计划（一个多诊所电子健康记录关联生物样本库）并结合外部全国概率样本，得出的癌症-性别关联估计值与国家基准估计值高度一致。我们还分析了MGI中癌症与多基因风险评分之间的关联，以说明暴露变量未在外部概率样本中测量的情况。