Causal inference from observational data requires untestable identification assumptions. If these assumptions apply, machine learning (ML) methods can be used to study complex forms of causal effect heterogeneity. Recently, several ML methods were developed to estimate the conditional average treatment effect (CATE). If the features at hand cannot explain all heterogeneity, the individual treatment effects (ITEs) can seriously deviate from the CATE. In this work, we demonstrate how the distributions of the ITE and the CATE can differ when a causal random forest (CRF) is applied. We extend the CRF to estimate the difference in conditional variance between treated and controls. If the ITE distribution equals the CATE distribution, this estimated difference in variance should be small. If they differ, an additional causal assumption is necessary to quantify the heterogeneity not captured by the CATE distribution. The conditional variance of the ITE can be identified when the individual effect is independent of the outcome under no treatment given the measured features. Then, in the cases where the ITE and CATE distributions differ, the extended CRF can appropriately estimate the variance of the ITE distribution while the CRF fails to do so.
翻译:基于观测数据进行因果推断需要不可检验的识别假设。若这些假设成立,机器学习方法可用于研究因果效应异质性的复杂形式。近年来,已有多项机器学习方法被开发用于估计条件平均处理效应。当现有特征无法解释全部异质性时,个体处理效应可能显著偏离条件平均处理效应。本研究展示了应用因果随机森林时,个体处理效应与条件平均处理效应分布可能存在的差异。我们扩展了因果随机森林以估计处理组与对照组之间的条件方差差异。若个体处理效应分布等于条件平均处理效应分布,该估计的方差差异应趋近于零;若两者存在差异,则需要额外的因果假设来量化条件平均处理效应分布未能捕获的异质性。当个体效应与给定测量特征下的无处理结果独立时,个体处理效应的条件方差可被识别。此时,在个体处理效应与条件平均处理效应分布存在差异的情况下,扩展的因果随机森林能恰当估计个体处理效应分布方差,而标准因果随机森林则无法实现。