Learning models that can handle distribution shifts is a key challenge in domain generalization. Invariance learning, an approach that focuses on identifying features invariant across environments, improves model generalization by capturing stable relationships, which may represent causal effects when the data distribution is encoded within a structural equation model (SEM) and satisfies modularity conditions. This has led to a growing body of work that builds on invariance learning, leveraging the inherent heterogeneity across environments to develop methods that provide causal explanations while enhancing robust prediction. However, in many practical scenarios, obtaining complete outcome data from each environment is challenging due to the high cost or complexity of data collection. This limitation in available data hinders the development of models that fully leverage environmental heterogeneity, making it crucial to address missing outcomes to improve both causal insights and robust prediction. In this work, we derive an estimator from the invariance objective under missing outcomes. We establish non-asymptotic guarantees on variable selection property and $\ell_2$ error convergence rates, which are influenced by the proportion of missing data and the quality of imputation models across environments. We evaluate the performance of the new estimator through extensive simulations and demonstrate its application using the UCI Bike Sharing dataset to predict the count of bike rentals. The results show that despite relying on a biased imputation model, the estimator is efficient and achieves lower prediction error, provided the bias is within a reasonable range.
翻译:处理分布偏移是领域泛化中的关键挑战。不变性学习通过聚焦跨环境不变特征来提升模型泛化能力,在数据分布由结构方程模型(SEM)编码且满足模块化条件时,能够捕获稳定关系(可能表示因果关系)。这催生了大量基于不变性学习的研究工作,利用环境间的固有异质性开发兼具因果解释与鲁棒预测能力的方法。然而在实际场景中,高昂的数据采集成本或复杂性常导致难以从每个环境获取完整结果数据。这种数据限制阻碍了充分利用环境异质性的模型发展,亟需解决结果缺失问题以提升因果推断与鲁棒预测性能。本文基于缺失结果下的不变性目标推导出估计量,建立了变量选择性质与$\ell_2$误差收敛率的非渐近保证,其受缺失数据比例及跨环境插补模型质量的影响。通过大量仿真实验评估新估计量的性能,并基于UCI共享单车数据集展示其在租车量预测中的应用。结果表明:尽管依赖有偏插补模型,只要偏差在合理范围内,该估计量仍保持高效且实现更低预测误差。