Causal machine learning methods and use of sample splitting in settings with high-dimensional confounding

Observational epidemiological studies commonly seek to estimate the causal effect of an exposure on an outcome. Adjustment for potential confounding bias in modern studies is challenging due to the presence of high-dimensional confounding, induced when there are many confounders relative to sample size, or complex relationships between continuous confounders and exposure and outcome. As a promising avenue to overcome this challenge, doubly robust methods (Augmented Inverse Probability Weighting (AIPW) and Targeted Maximum Likelihood Estimation (TMLE)) enable the use of data-adaptive approaches to fit the two models they involve. Biased standard errors may result when the data-adaptive approaches used are very complex. The coupling of doubly robust methods with cross-fitting has been proposed to tackle this. Despite advances, limited evaluation, comparison, and guidance are available on the implementation of AIPW and TMLE with data-adaptive approaches and cross-fitting in realistic settings where high-dimensional confounding is present. We conducted an extensive simulation study to compare the relative performance of AIPW and TMLE using data-adaptive approaches in estimating the average causal effect (ACE) and evaluated the benefits of using cross-fitting with a varying number of folds, as well as the impact of using a reduced versus full (larger, more diverse) library in the Super Learner (SL) ensemble learning approach used for the data-adaptive models. A range of scenarios in terms of data generation, and sample size were considered. We found that AIPW and TMLE performed similarly in most cases for estimating the ACE, but TMLE was more stable. Cross-fitting improved the performance of both methods, with the number of folds a less important consideration. Using a full SL library was important to reduce bias and variance in the complex scenarios typical of modern health research studies.

翻译：观察性流行病学研究通常旨在估计暴露对结果的因果效应。由于高维混杂的存在，现代研究中潜在混杂偏倚的调整具有挑战性——当混杂变量数量相对于样本量较多，或连续混杂变量与暴露及结果间存在复杂关系时，便会产生高维混杂。双重稳健方法（增强逆概率加权（AIPW）与目标最大似然估计（TMLE））为解决这一挑战提供了前景广阔的途径，它们允许使用数据自适应方法拟合所涉及的两个模型。当采用的数据自适应方法过于复杂时，可能导致标准误估计偏误。为此，学界提出了将双重稳健方法与交叉拟合相结合的策略。尽管已有进展，但在存在高维混杂的实际场景中，关于AIPW和TMLE结合数据自适应方法与交叉拟合的实施，仍缺乏充分的评估、比较和操作指导。我们通过广泛的模拟研究，比较了采用数据自适应方法的AIPW和TMLE在估计平均因果效应（ACE）时的相对性能，评估了使用不同折数的交叉拟合的益处，以及在使用Super Learner（SL）集成学习方法构建数据自适应模型时，采用简化模型库与完整（更大、更多样化）模型库的影响。研究考虑了数据生成机制和样本量等多种情境。结果表明，在大多数情况下AIPW和TMLE估计ACE的表现相近，但TMLE稳定性更佳。交叉拟合提升了两种方法的性能，而折数选择的影响相对较小。在现代健康研究常见的复杂场景中，使用完整的SL模型库对于降低偏误和方差至关重要。