Modern machine learning methods and the availability of large-scale data opened the door to accurately predict target quantities from large sets of covariates. However, existing prediction methods can perform poorly when the training and testing data are different, especially in the presence of hidden confounding. While hidden confounding is well studied for causal effect estimation (e.g., instrumental variables), this is not the case for prediction tasks. This work aims to bridge this gap by addressing predictions under different training and testing distributions in the presence of unobserved confounding. In particular, we establish a novel connection between the field of distribution generalization from machine learning, and simultaneous equation models and control function from econometrics. Central to our contribution are simultaneous equation models for distribution generalization (SIMDGs) which describe the data-generating process under a set of distributional shifts. Within this framework, we propose a strong notion of invariance for a predictive model and compare it with existing (weaker) versions. Building on the control function approach from instrumental variable regression, we propose the boosted control function (BCF) as a target of inference and prove its ability to successfully predict even in intervened versions of the underlying SIMDG. We provide necessary and sufficient conditions for identifying the BCF and show that it is worst-case optimal. We introduce the ControlTwicing algorithm to estimate the BCF and analyze its predictive performance on simulated and real world data.
翻译:现代机器学习方法和大规模数据的可用性为从大量协变量中精确预测目标量打开了大门。然而,当训练数据和测试数据分布不同时,尤其是存在隐藏混杂因素的情况下,现有预测方法的性能可能不佳。尽管隐藏混杂因素在因果效应估计(如工具变量)中已有充分研究,但在预测任务中却并非如此。本文旨在通过解决存在未观测混杂时不同训练和测试分布下的预测问题来填补这一空白。具体而言,我们在机器学习的分布泛化领域与计量经济学中的联立方程模型和控制函数之间建立了一种新颖的联系。我们贡献的核心是用于分布泛化的联立方程模型(SIMDGs),该模型描述了一组分布偏移下的数据生成过程。在此框架内,我们提出了一种预测模型的强不变性概念,并将其与现有的(较弱)版本进行比较。基于工具变量回归中的控制函数方法,我们提出了提升控制函数(BCF)作为推理的目标,并证明了即使在对底层SIMDG进行干预变化的情况下,它也能成功进行预测。我们给出了识别BCF的充分必要条件,并证明了它在最坏情况下的最优性。我们引入了ControlTwicing算法来估计BCF,并在模拟和真实数据上分析了其预测性能。