Demystifying and avoiding the OLS "weighting problem": Unmodeled heterogeneity and straightforward solutions

Researchers have long run regressions of an outcome variable (Y) on a treatment (D) and covariates (X) to estimate treatment effects. Even absent unobserved confounding, the regression coefficient on D in this setup reports a conditional variance weighted average of strata-wise average effects, not generally equal to the average treatment effect (ATE). Numerous proposals have been offered to cope with this "weighting problem", including interpretational tools to help characterize the weights and diagnostic aids to help researchers assess the potential severity of this problem. We make two contributions that together suggest an alternative direction for researchers and this literature. Our first contribution is conceptual, demystifying these weights. Simply put, under heterogeneous treatment effects (and varying probability of treatment), the linear regression of Y on D and X will be misspecified. The "weights" of regression offer one characterization for the coefficient from regression that helps to clarify how it will depart from the ATE. We also derive a more general expression for the weights than what is usually referenced. Our second contribution is practical: as these weights simply characterize misspecification bias, we suggest simply avoiding them through an approach that tolerate heterogeneous effects. A wide range of longstanding alternatives (regression-imputation/g-computation, interacted regression, and balancing weights) relax specification assumptions to allow heterogeneous effects. We make explicit the assumption of "separate linearity", under which each potential outcome is separately linear in X. This relaxation of conventional linearity offers a common justification for all of these methods and avoids the weighting problem, at an efficiency cost that will be small when there are few covariates relative to sample size.

翻译：长期以来，研究者通过对结果变量(Y)在干预变量(D)和协变量(X)上进行回归来估计干预效应。即使不存在未观测混杂，该设定中D的回归系数报告的是分层平均效应的条件方差加权平均值，通常不等于平均干预效应(ATE)。已有众多方案被提出以应对这一"加权问题"，包括帮助描述权重特征的解释工具和帮助研究者评估该问题潜在严重性的诊断辅助手段。我们提出两项贡献，共同为研究者和该领域文献指明替代方向。我们的第一项贡献是概念性的，旨在阐明这些权重的本质。简言之，在异质性干预效应（及变动的干预概率）条件下，Y对D和X的线性回归必然存在设定偏误。回归的"权重"为回归系数提供了一种特征描述，有助于阐明其将如何偏离ATE。我们还推导出比通常引用的表达式更普适的权重公式。我们的第二项贡献是实践性的：由于这些权重本质上表征的是设定偏误，我们建议通过能够容纳异质效应的方法直接规避该问题。一系列长期存在的替代方法（回归插补/g-计算、交互回归和平衡权重）通过放宽设定假设来允许异质效应。我们明确提出了"分离线性"假设，在该假设下每个潜在结果在X上分别呈线性关系。这种对传统线性假设的放宽为所有方法提供了共同的理论依据，并规避了加权问题，其代价是当协变量数量相对于样本量较小时可忽略的效率损失。