Demystifying and avoiding the OLS "weighting problem": Unmodeled heterogeneity and straightforward solutions

Researchers have long run regressions of an outcome variable (Y) on a treatment (D) and covariates (X) to estimate treatment effects. Even absent unobserved confounding, the regression coefficient on D in this setup reports a conditional variance weighted average of strata-wise average effects, not generally equal to the average treatment effect (ATE). Numerous proposals have been offered to cope with this "weighting problem", including interpretational tools to help characterize the weights and diagnostic aids to help researchers assess the potential severity of this problem. We make two contributions that together suggest an alternative direction for researchers and this literature. Our first contribution is conceptual, demystifying these weights. Simply put, under heterogeneous treatment effects (and varying probability of treatment), the linear regression of Y on D and X will be misspecified. The "weights" of regression offer one characterization for the coefficient from regression that helps to clarify how it will depart from the ATE. We also derive a more general expression for the weights than what is usually referenced. Our second contribution is practical: as these weights simply characterize misspecification bias, we suggest simply avoiding them through an approach that tolerate heterogeneous effects. A wide range of longstanding alternatives (regression-imputation/g-computation, interacted regression, and balancing weights) relax specification assumptions to allow heterogeneous effects. We make explicit the assumption of "separate linearity", under which each potential outcome is separately linear in X. This relaxation of conventional linearity offers a common justification for all of these methods and avoids the weighting problem, at an efficiency cost that will be small when there are few covariates relative to sample size.

翻译：长期以来，研究者常通过将结果变量(Y)对处理变量(D)及协变量(X)进行回归来估计处理效应。即使不存在未观测混杂，该设定中D的回归系数所报告的是分层平均效应的条件方差加权平均值，通常不等于平均处理效应(ATE)。已有大量研究提出应对此"加权问题"的方案，包括帮助描述权重特征的解释工具，以及协助研究者评估该问题潜在严重性的诊断方法。本文通过两项贡献共同为研究者和相关文献指出替代方向。第一项为概念性贡献：阐明这些权重的本质。简言之，在存在异质性处理效应（及变动的处理概率）时，Y对D和X的线性回归必然存在误设。回归的"权重"为回归系数提供了一种特征描述，有助于阐明其如何偏离ATE。我们还推导出比通常引用的表达式更普适的权重公式。第二项为实践性贡献：既然这些权重本质上表征的是误设偏误，我们建议通过能够容纳异质效应的方法直接规避该问题。大量长期存在的替代方法（回归插补/g-计算、交互回归、平衡权重）通过放宽设定假设来允许异质效应。我们明确提出"分离线性"假设，即每个潜在结果在X上分别呈线性。这种对传统线性假设的放宽为所有方法提供了共同的理论依据，并能规避加权问题，其代价是当协变量数量相对样本量较小时可忽略的效率损失。