We consider regression in which one predicts a response $Y$ with a set of predictors $X$ across different experiments or environments. This is a common setup in many data-driven scientific fields and we argue that statistical inference can benefit from an analysis that takes into account the distributional changes across environments. In particular, it is useful to distinguish between stable and unstable predictors, i.e., predictors which have a fixed or a changing functional dependence on the response, respectively. We introduce stabilized regression which explicitly enforces stability and thus improves generalization performance to previously unseen environments. Our work is motivated by an application in systems biology. Using multiomic data, we demonstrate how hypothesis generation about gene function can benefit from stabilized regression. We believe that a similar line of arguments for exploiting heterogeneity in data can be powerful for many other applications as well. We draw a theoretical connection between multi-environment regression and causal models, which allows to graphically characterize stable versus unstable functional dependence on the response. Formally, we introduce the notion of a stable blanket which is a subset of the predictors that lies between the direct causal predictors and the Markov blanket. We prove that this set is optimal in the sense that a regression based on these predictors minimizes the mean squared prediction error given that the resulting regression generalizes to unseen new environments.
翻译:我们考虑在不同实验或环境下通过一组预测变量 X 预测响应 Y 的回归问题。这是许多数据驱动科学领域中的常见设置,我们认为统计分析可以受益于考虑环境间分布变化的分析。特别地,区分稳定预测变量与不稳定预测变量(即分别与响应具有固定或变化函数依赖关系的预测变量)是很有用的。我们引入了稳定回归方法,该方法显式地强制稳定性,从而提升对未见过新环境的泛化性能。我们的工作受到系统生物学应用的启发。利用多组学数据,我们展示了关于基因功能的假设生成如何受益于稳定回归。我们相信,类似地利用数据异质性的论证思路对于许多其他应用也可能具有强大作用。我们在多环境回归与因果模型之间建立了理论联系,从而可以图形化地描绘对响应的稳定与不稳定函数依赖关系。形式上,我们引入了稳定毯的概念,它是介于直接因果预测变量与马尔可夫毯之间的一组预测变量子集。我们证明该集合在基于这些预测变量的回归能泛化到未见过新环境的前提下,最小化均方预测误差的意义上是最优的。