When estimating causal effects from observational studies, researchers often need to adjust for many covariates to deconfound the non-causal relationship between exposure and outcome, among which many covariates are discrete. The behavior of commonly used estimators in the presence of many discrete covariates is not well understood since their properties are often analyzed under structural assumptions including sparsity and smoothness, which do not apply in discrete settings. In this work, we study the estimation of causal effects in a model where the covariates required for confounding adjustment are discrete but high-dimensional, meaning the number of categories $d$ is comparable with or even larger than sample size $n$. Specifically, we show the mean squared error of commonly used regression, weighting and doubly robust estimators is bounded by $\frac{d^2}{n^2}+\frac{1}{n}$. We then prove the minimax lower bound for the average treatment effect is of order $\frac{d^2}{n^2 \log^2 n}+\frac{1}{n}$, which characterizes the fundamental difficulty of causal effect estimation in the high-dimensional discrete setting, and shows the estimators mentioned above are rate-optimal up to log-factors. We further consider additional structures that can be exploited, namely effect homogeneity and prior knowledge of the covariate distribution, and propose new estimators that enjoy faster convergence rates of order $\frac{d}{n^2} + \frac{1}{n}$, which achieve consistency in a broader regime. The results are illustrated empirically via simulation studies.
翻译:在从观察性研究中估计因果效应时,研究者通常需要调整多个协变量以消除暴露与结局之间的非因果关系混淆,其中许多协变量是离散的。由于常用估计量在存在大量离散协变量时的行为尚不明确(其性质通常基于稀疏性和光滑性等结构假设进行分析,而这些假设不适用于离散场景),本文研究了一种调整混淆所需的协变量为离散但高维(即类别数$d$与样本量$n$相当甚至更大)的模型中的因果效应估计。具体而言,我们证明了常用回归、加权和双重稳健估计量的均方误差上界为$\frac{d^2}{n^2}+\frac{1}{n}$。随后,我们证明了平均处理效应的极小化极大下界阶数为$\frac{d^2}{n^2 \log^2 n}+\frac{1}{n}$,这刻画了高维离散场景下因果效应估计的根本困难,并表明上述估计量在对数因子意义下达到最优速率。我们进一步考虑了可被利用的额外结构(即效应同质性和协变量分布的已知先验信息),并提出了收敛速度更快(阶数为$\frac{d}{n^2} + \frac{1}{n}$)的新估计量,这些估计量在更广泛的场景下实现一致性。通过模拟研究对结果进行了实证验证。