This paper considers a multi-environment linear regression model in which data from multiple experimental settings are collected. The joint distribution of the response variable and covariates may vary across different environments, yet the conditional expectations of $y$ given the unknown set of important variables are invariant. Such a statistical model is related to the problem of endogeneity, causal inference, and transfer learning. The motivation behind it is illustrated by how the goals of prediction and attribution are inherent in estimating the true parameter and the important variable set. We construct a novel environment invariant linear least squares (EILLS) objective function, a multi-environment version of linear least-squares regression that leverages the above conditional expectation invariance structure and heterogeneity among different environments to determine the true parameter. Our proposed method is applicable without any additional structural knowledge and can identify the true parameter under a near-minimal identification condition. We establish non-asymptotic $\ell_2$ error bounds on the estimation error for the EILLS estimator in the presence of spurious variables. Moreover, we further show that the $\ell_0$ penalized EILLS estimator can achieve variable selection consistency in high-dimensional regimes. These non-asymptotic results demonstrate the sample efficiency of the EILLS estimator and its capability to circumvent the curse of endogeneity in an algorithmic manner without any prior structural knowledge. To the best of our knowledge, this paper is the first to realize statistically efficient invariance learning in the general linear model.
翻译:本文研究一种多环境线性回归模型,其中数据来自多个实验场景。响应变量与协变量的联合分布可能随环境变化,但给定未知重要变量集时$y$的条件期望保持不变。此类统计模型与内生性问题、因果推断和迁移学习密切相关。其动机在于说明预测和归因目标如何内蕴于真实参数与重要变量集的估计过程中。我们构建了一种新颖的环境不变线性最小二乘目标函数——这是线性最小二乘回归的多环境版本,其利用上述条件期望不变性结构及环境间异质性来识别真实参数。所提方法无需任何额外结构知识即可应用,并能在接近最小识别条件下确定真实参数。在存在伪变量的情况下,我们建立了EILLS估计量估计误差的非渐近$\ell_2$误差界。进一步证明在高维场景下,$\ell_0$惩罚EILLS估计量可实现变量选择一致性。这些非渐近结果验证了EILLS估计量的样本效率及其以算法方式规避内生性诅咒的能力,且无需任何先验结构知识。据我们所知,本文首次在一般线性模型中实现了统计意义下高效的不变性学习。