A New Covariate Selection Strategy for High Dimensional Data in Causal Effect Estimation with Multivariate Treatments

Selection of covariates is crucial in the estimation of average treatment effects given observational data with high or even ultra-high dimensional pretreatment variables. Existing methods for this problem typically assume sparse linear models for both outcome and univariate treatment, and cannot handle situations with ultra-high dimensional covariates. In this paper, we propose a new covariate selection strategy called double screening prior adaptive lasso (DSPAL) to select confounders and predictors of the outcome for multivariate treatments, which combines the adaptive lasso method with the marginal conditional (in)dependence prior information to select target covariates, in order to eliminate confounding bias and improve statistical efficiency. The distinctive features of our proposal are that it can be applied to high-dimensional or even ultra-high dimensional covariates for multivariate treatments, and can deal with the cases of both parametric and nonparametric outcome models, which makes it more robust compared to other methods. Our theoretical analyses show that the proposed procedure enjoys the sure screening property, the ranking consistency property and the variable selection consistency. Through a simulation study, we demonstrate that the proposed approach selects all confounders and predictors consistently and estimates the multivariate treatment effects with smaller bias and mean squared error compared to several alternatives under various scenarios. In real data analysis, the method is applied to estimate the causal effect of a three-dimensional continuous environmental treatment on cholesterol level and enlightening results are obtained.

翻译：协变量选择在基于高维甚至超高维预处理变量的观测数据估计平均处理效应时至关重要。现有方法通常假设结果变量和单变量处理均服从稀疏线性模型，无法处理超高维协变量情况。本文提出一种名为双重筛选先验自适应套索（DSPAL）的新型协变量选择策略，用于为多变量处理选择混杂因素和结果预测因子。该方法将自适应套索与边际条件（独立）先验信息相结合来选择目标协变量，旨在消除混杂偏倚并提高统计效率。本方法的显著特点在于：可适用于多变量处理的高维乃至超高维协变量场景，并能处理参数和非参数结果模型的情况，相较于其他方法具有更强的鲁棒性。理论分析表明，所提方法具有确定筛选性质、排序一致性和变量选择一致性。通过模拟研究，我们证明该方法能够一致地选择所有混杂因素和预测因子，并在多种场景下相比其他替代方法以更小的偏倚和均方误差估计多变量处理效应。在真实数据分析中，将该方法应用于估计三维连续环境处理对胆固醇水平的因果效应，获得了具有启示性的结果。