Real-World Data (RWD), with its large sample sizes and rich clinical detail, offers a compelling alternative to randomized controlled trials (RCTs) for studying treatment effects in diverse and complex patient populations. However, its observational nature introduces confounding that prevents straightforward comparative effectiveness research. Target trial emulation leverages RWD to estimate average treatment effects (ATE) at the population scale and diversity that RCTs cannot achieve, yet its validity depends critically on unbiased ATE estimation under high-dimensional confounding. Many causal inference pipelines address high-dimensional confounding through machine learning and artificial intelligence (ML/AI) outcome regression. However, commonly used ML/AI regression models exhibit systematic prediction bias, with predicted outcomes shrinking toward the marginal outcome mean. This structural bias propagates into ATE estimation and cannot be corrected by cross-fitting, ensemble methods, or any standard ML practice. In this work, we first quantitatively characterize how systematic prediction bias in ML/AI outcome regression leads to biased ATE estimates in causal inference models. We further propose an unbiased ML/AI regression-based causal inference framework to ensure unbiased ATE estimation for observational studies. We demonstrate our approach by studying the effects of opioids on cardiovascular health in patients with chronic pain using UK Biobank data.
翻译:真实世界数据(RWD)凭借其大样本量和丰富的临床细节,为研究多样化复杂患者群体的治疗效果提供了随机对照试验(RCT)之外的强有力替代方案。然而,其观察性本质引入了混杂因素,阻碍了直接的比较效果研究。目标试验模拟利用RWD来估计RCT无法实现的人群规模与多样性的平均处理效应(ATE),但其有效性关键取决于在高维混杂因素下对ATE的无偏估计。许多因果推断流程通过机器学习和人工智能(ML/AI)结果回归来处理高维混杂。然而,常用的ML/AI回归模型表现出系统性预测偏差,预测结果向边际结果均值收缩。这种结构性偏差会传播至ATE估计中,且无法通过交叉拟合、集成方法或任何标准ML实践来校正。本文首先定量刻画了ML/AI结果回归中的系统性预测偏差如何导致因果推断模型中ATE估计的有偏性。我们进一步提出了一种基于无偏ML/AI回归的因果推断框架,以确保观察性研究中ATE估计的无偏性。通过使用英国生物样本库数据研究阿片类药物对慢性疼痛患者心血管健康的影响,我们验证了该方法的效果。