We study the problem of estimating causal effects under hidden confounding in the following unpaired data setting: we observe some covariates $X$ and an outcome $Y$ under different experimental conditions (environments) but do not observe them jointly; we either observe $X$ or $Y$. Under appropriate regularity conditions, the problem can be cast as an instrumental variable (IV) regression with the environment acting as a (possibly high-dimensional) instrument. When there are many environments but only a few observations per environment, standard two-sample IV estimators fail to be consistent. We propose a GMM-type estimator based on cross-fold sample splitting of the instrument-covariate sample and prove that it is consistent as the number of environments grows but the sample size per environment remains constant. We further extend the method to sparse causal effects via $\ell_1$-regularized estimation and post-selection refitting.
翻译:我们研究在以下非配对数据设置中,存在隐藏混杂因素时的因果效应估计问题:我们在不同实验条件(环境)下观测到某些协变量$X$和结果变量$Y$,但未观测到它们的联合分布;我们仅能观测$X$或$Y$。在适当的正则性条件下,该问题可转化为工具变量(IV)回归问题,其中环境作为(可能高维的)工具变量。当存在大量环境但每个环境仅有少量观测时,传统的双样本IV估计量无法保持一致性。我们提出一种基于工具变量-协变量样本交叉折叠分割的GMM型估计量,并证明当环境数量增加而每个环境的样本量保持固定时,该估计量具有一致性。我们进一步通过$\ell_1$正则化估计与后选择重拟合,将方法扩展至稀疏因果效应的情形。