In causal machine learning, the fitting and evaluation of nuisance models are typically performed on separate partitions, or folds, of the observed data. This technique, called cross-fitting, eliminates bias introduced by the use of black-box predictive algorithms. When study units may be correlated, such as in spatial, clustered, or time-series data, investigators often design bespoke forms of cross-fitting to minimize correlation between folds. We prove that, perhaps contrary to popular belief, this is typically unnecessary: performing cross-fitting as if study units were independent usually still eliminates key bias terms even when units may be correlated. In simulation experiments with various correlation structures, we show that causal machine learning estimators typically have the same or improved bias and precision under cross-fitting that ignores correlation compared to techniques striving to eliminate correlation between folds.
翻译:在因果机器学习中,干扰模型的拟合与评估通常基于观测数据的独立分区(即折)进行。这种称为交叉拟合的技术能够消除由黑盒预测算法引入的偏差。当研究单元可能存在相关性时(例如空间数据、聚类数据或时间序列数据),研究者通常会设计定制化的交叉拟合形式以最小化折间的相关性。我们证明,与普遍认知可能相反的是,这种做法通常并非必要:即使单元间存在相关性,采用基于独立单元假设的交叉拟合方法通常仍能消除关键偏差项。通过在不同相关结构下的模拟实验,我们证明相较于刻意消除折间相关性的技术,忽略相关性的交叉拟合方法通常能使因果机器学习估计量保持相同或更优的偏差与精度水平。