When estimating causal effects from observational studies, researchers often need to adjust for many covariates to deconfound the non-causal relationship between exposure and outcome, among which many covariates are discrete. The behavior of commonly used estimators in the presence of many discrete covariates is not well understood since their properties are often analyzed under structural assumptions including sparsity and smoothness, which do not apply in discrete settings. In this work, we study the estimation of causal effects in a model where the covariates required for confounding adjustment are discrete but high-dimensional, meaning the number of categories $d$ is comparable with or even larger than sample size $n$. Specifically, we show the mean squared error of commonly used regression, weighting and doubly robust estimators is bounded by $\frac{d^2}{n^2}+\frac{1}{n}$. We then prove the minimax lower bound for the average treatment effect is of order $\frac{d^2}{n^2 \log^2 n}+\frac{1}{n}$, which characterizes the fundamental difficulty of causal effect estimation in the high-dimensional discrete setting, and shows the estimators mentioned above are rate-optimal up to log-factors. We further consider additional structures that can be exploited, namely effect homogeneity and prior knowledge of the covariate distribution, and propose new estimators that enjoy faster convergence rates of order $\frac{d}{n^2} + \frac{1}{n}$, which achieve consistency in a broader regime. The results are illustrated empirically via simulation studies.
翻译:在基于观测研究估计因果效应时,研究人员常需调整大量协变量以消除暴露与结局之间的非因果关系,其中许多协变量为离散型。由于常用估计量的性质通常基于稀疏性和光滑性等结构假设进行分析,而这些假设在离散情境下不成立,因此当存在大量离散协变量时,这些估计量的行为尚不明确。本文研究协变量为离散但高维(即类别数$d$与样本量$n$可比甚至更大)的模型中的因果效应估计问题。具体而言,我们证明常用回归估计量、加权估计量和双重稳健估计量的均方误差受$\frac{d^2}{n^2}+\frac{1}{n}$限制。随后证明平均处理效应的极小极大下界为$\frac{d^2}{n^2 \log^2 n}+\frac{1}{n}$量级,这一结果刻画了高维离散情境下因果效应估计的根本难度,并表明上述估计量在因子对数条件下达到速率最优。我们进一步考虑可被利用的额外结构,即效应同质性和协变量分布的先验知识,并提出收敛速度更快($\frac{d}{n^2} + \frac{1}{n}$量级)的新估计量,这类估计量可在更广泛条件下实现一致性。最后通过模拟研究对结果进行实证验证。