Practitioners often face the challenge of deploying prediction models in new environments with shifted distributions of covariates and responses. With observational data, such shifts are often driven by unobserved confounding, and can in fact alter the concept of which model is best. This paper studies distribution shifts in the domain adaptation problem with unobserved confounding. We postulate a linear structural causal model to account for endogeneity and unobserved confounding, and we leverage exogenous invariant covariate representations to cure concept shifts and improve target prediction. We propose a data-driven representation learning method that optimizes for a lower-dimensional linear subspace and a prediction model confined to that subspace. This method operates on a non-convex objective -- that interpolates between predictability and stability -- constrained to the Stiefel manifold, using an analog of projected gradient descent. We analyze the optimization landscape and prove that, provided sufficient regularization, nearly all local optima align with an invariant linear subspace resilient to distribution shifts. This method achieves a nearly ideal gap between target and source risk. We validate the method and theory with real-world data sets to illustrate the tradeoffs between predictability and stability.
翻译:实践者在将预测模型部署到协变量与响应变量分布发生迁移的新环境时,常面临挑战。在观测数据中,此类迁移往往由未观测混杂因素驱动,甚至可能改变最优模型的概念。本文研究存在未观测混杂因素时的领域自适应分布迁移问题。我们构建线性结构因果模型来解释内生性与未观测混杂,并利用外生不变协变量表示来消除概念迁移、改进目标域预测。我们提出一种数据驱动的表示学习方法,该方法优化低维线性子空间及其约束下的预测模型。该算法基于非凸目标函数——在可预测性与稳定性间进行插值——并约束于斯蒂弗尔流形,采用投影梯度下降的类比方法。我们分析优化景观并证明:在充分正则化条件下,几乎所有局部最优解均与抵抗分布迁移的不变线性子空间对齐。该方法实现了近乎理想的目标域与源域风险差距。我们通过真实数据集验证方法与理论,阐明可预测性与稳定性间的权衡。