Confounding remains one of the major challenges to causal inference with observational data. This problem is paramount in medicine, where we would like to answer causal questions from large observational datasets like electronic health records (EHRs) and administrative claims. Modern medical data typically contain tens of thousands of covariates. Such a large set carries hope that many of the confounders are directly measured, and further hope that others are indirectly measured through their correlation with measured covariates. How can we exploit these large sets of covariates for causal inference? To help answer this question, this paper examines the performance of the large-scale propensity score (LSPS) approach on causal analysis of medical data. We demonstrate that LSPS may adjust for indirectly measured confounders by including tens of thousands of covariates that may be correlated with them. We present conditions under which LSPS removes bias due to indirectly measured confounders, and we show that LSPS may avoid bias when inadvertently adjusting for variables (like colliders) that otherwise can induce bias. We demonstrate the performance of LSPS with both simulated medical data and real medical data.
翻译:混杂仍是观察性数据进行因果推断的主要挑战之一。这一问题在医学领域尤为突出——我们常需通过电子健康记录(EHR)和管理性索赔等大型观察性数据集来解答因果问题。现代医学数据通常包含数万个协变量。如此庞大的协变量集合带来了希望:一方面,许多混杂因素可能被直接测量;另一方面,其他混杂因素可能通过与已测量协变量的相关性而得到间接测量。我们应如何利用这些大规模协变量进行因果推断?为回答这一问题,本文考察了大规模倾向性评分(LSPS)方法在医学数据因果分析中的表现。研究表明,LSPS通过纳入可能与间接测量混杂因素相关的数万个协变量,能够实现对后者的调整。我们提出了LSPS消除间接测量混杂偏倚的适用条件,并证明该方法在无意中纳入易引发偏倚的变量(如碰撞因子)时可避免产生偏倚。最后,我们通过模拟医学数据和真实医学数据验证了LSPS的性能。