Understanding and avoiding the "weights of regression": Heterogeneous effects, misspecification, and longstanding solutions

Researchers in many fields endeavor to estimate treatment effects by regressing outcome data (Y) on a treatment (D) and observed confounders (X). Even absent unobserved confounding, the regression coefficient on the treatment reports a weighted average of strata-specific treatment effects (Angrist, 1998). Where heterogeneous treatment effects cannot be ruled out, the resulting coefficient is thus not generally equal to the average treatment effect (ATE), and is unlikely to be the quantity of direct scientific or policy interest. The difference between the coefficient and the ATE has led researchers to propose various interpretational, bounding, and diagnostic aids (Humphreys, 2009; Aronow and Samii, 2016; Sloczynski, 2022; Chattopadhyay and Zubizarreta, 2023). We note that the linear regression of Y on D and X can be misspecified when the treatment effect is heterogeneous in X. The "weights of regression", for which we provide a new (more general) expression, simply characterize how the OLS coefficient will depart from the ATE under the misspecification resulting from unmodeled treatment effect heterogeneity. Consequently, a natural alternative to suffering these weights is to address the misspecification that gives rise to them. For investigators committed to linear approaches, we propose relying on the slightly weaker assumption that the potential outcomes are linear in X. Numerous well-known estimators are unbiased for the ATE under this assumption, namely regression-imputation/g-computation/T-learner, regression with an interaction of the treatment and covariates (Lin, 2013), and balancing weights. Any of these approaches avoid the apparent weighting problem of the misspecified linear regression, at an efficiency cost that will be small when there are few covariates relative to sample size. We demonstrate these lessons using simulations in observational and experimental settings.

翻译：许多领域的研究人员试图通过将结果数据（Y）对处理变量（D）和观测到的混杂因素（X）进行回归来估计处理效应。即使不存在未观测到的混杂，处理变量的回归系数报告的是分层特异性处理效应的加权平均值（Angrist, 1998）。当无法排除异质性处理效应时，所得系数通常不等同于平均处理效应（ATE），且不太可能成为科学或政策直接关注的量。系数与ATE之间的差异促使研究者提出了多种解释性、边界性和诊断性辅助方法（Humphreys, 2009；Aronow and Samii, 2016；Sloczynski, 2022；Chattopadhyay and Zubizarreta, 2023）。我们指出，当处理效应在X上存在异质性时，Y对D和X的线性回归可能存在模型误设。“回归权重”（我们给出了新的、更一般的表达式）仅刻画了在未建模的处理效应异质性所导致的误设下，OLS系数如何偏离ATE。因此，解决这些权重的自然替代方案是纠正产生它们的模型误设。对于致力于线性方法的研究者，我们建议采用稍弱的假设：潜在结果在X上是线性的。在此假设下，许多众所周知的估计量（即回归插补法/g-计算法/T学习器、包含处理与协变量交互项的回归（Lin, 2013）以及平衡权重）对ATE是无偏的。这些方法均能避免误设线性回归中明显的权重问题，其效率损失在协变量数量相对于样本量较小时将很小。我们通过观察性和实验性场景下的模拟研究展示了这些经验教训。