The synthetic control (SC) method is a popular approach for estimating treatment effects from observational panel data. It rests on a crucial assumption that we can write the treated unit as a linear combination of the untreated units. This linearity assumption, however, can be unlikely to hold in practice and, when violated, the resulting SC estimates are incorrect. In this paper we examine two questions: (1) How large can the misspecification error be? (2) How can we limit it? First, we provide theoretical bounds to quantify the misspecification error. The bounds are comforting: small misspecifications induce small errors. With these bounds in hand, we then develop new SC estimators that are specially designed to minimize misspecification error. The estimators are based on additional data about each unit, which is used to produce the SC weights. (For example, if the units are countries then the additional data might be demographic information about each.) We study our estimators on synthetic data; we find they produce more accurate causal estimates than standard synthetic controls. We then re-analyze the California tobacco-program data of the original SC paper, now including additional data from the US census about per-state demographics. Our estimators show that the observations in the pre-treatment period lie within the bounds of misspecification error, and that the observations post-treatment lie outside of those bounds. This is evidence that our SC methods have uncovered a true effect.
翻译:合成控制(SC)方法是基于观测面板数据估计处理效应的常用方法。其关键假设在于处理单元可表示为未处理单元的线性组合。然而在实践中,这种线性假设往往难以成立,一旦违背该假设,基于SC方法得出的估计结果便会出现偏差。本文探讨两个核心问题:(1)误设误差的潜在规模有多大?(2)如何限制该误差?首先,我们给出了量化误设误差的理论界。这些界限令人欣慰:微小的误设只会引发较小的误差。基于这些理论界,我们进一步开发了专门用于最小化误设误差的新型SC估计量。这些估计量利用各单元的额外数据生成SC权重(例如,若分析对象为国家,额外数据可包含各国人口统计信息)。我们在合成数据上验证了所提估计量,发现它们比标准合成控制方法能产生更准确的因果估计。随后,我们重新分析了原始SC论文中的加州烟草项目数据,并纳入美国人口普查提供的州级人口统计附加数据。我们的估计结果显示:预处理期的观测值位于误设误差边界内,而处理后的观测值则超出该边界。这证明我们的SC方法成功揭示了真实效应。