Motivated by a recent literature on the double-descent phenomenon in machine learning, we consider highly over-parametrized models in causal inference, including synthetic control with many control units. In such models, there may be so many free parameters that the model fits the training data perfectly. As a motivating example, we first investigate high-dimensional linear regression for imputing wage data, where we find that models with many more covariates than sample size can outperform simple ones. As our main contribution, we document the performance of high-dimensional synthetic control estimators with many control units. We find that adding control units can help improve imputation performance even beyond the point where the pre-treatment fit is perfect. We then provide a unified theoretical perspective on the performance of these high-dimensional models. Specifically, we show that more complex models can be interpreted as model-averaging estimators over simpler ones, which we link to an improvement in average performance. This perspective yields concrete insights into the use of synthetic control when control units are many relative to the number of pre-treatment periods.
翻译:受机器学习领域双下降现象相关最新文献的启发,我们探讨了因果推断中高度过参数化模型(包括包含大量对照单元的合成控制方法)的表现。在此类模型中,自由参数数量可能过多,以至于模型能完美拟合训练数据。作为启发性案例,我们首先研究用于工资数据插补的高维线性回归,发现协变量数量远超样本量的模型可能优于简单模型。作为主要贡献,我们系统记录了包含大量对照单元的高维合成控制估计量的性能表现。研究发现,即使在前处理拟合已完美的情况下,增加对照单元仍可能改善插补性能。随后我们为这些高维模型的性能表现提供了统一的理论视角:具体而言,我们证明更复杂的模型可被解释为对简单模型进行模型平均的估计量,并将此与平均性能的改善建立联系。这一视角为前处理期数量有限而对照单元数量较多时的合成控制方法应用提供了具体洞见。