Consider a scenario where we have access to train data with both covariates and outcomes while test data only contains covariates. In this scenario, our primary aim is to predict the missing outcomes of the test data. With this objective in mind, we train parametric regression models under a covariate shift, where covariate distributions are different between the train and test data. For this problem, existing studies have proposed covariate shift adaptation via importance weighting using the density ratio. This approach averages the train data losses, each weighted by an estimated ratio of the covariate densities between the train and test data, to approximate the test-data risk. Although it allows us to obtain a test-data risk minimizer, its performance heavily relies on the accuracy of the density ratio estimation. Moreover, even if the density ratio can be consistently estimated, the estimation errors of the density ratio also yield bias in the estimators of the regression model's parameters of interest. To mitigate these challenges, we introduce a doubly robust estimator for covariate shift adaptation via importance weighting, which incorporates an additional estimator for the regression function. Leveraging double machine learning techniques, our estimator reduces the bias arising from the density ratio estimation errors. We demonstrate the asymptotic distribution of the regression parameter estimator. Notably, our estimator remains consistent if either the density ratio estimator or the regression function is consistent, showcasing its robustness against potential errors in density ratio estimation. Finally, we confirm the soundness of our proposed method via simulation studies.
翻译:考虑一个场景:我们能够获取包含协变量和结果变量的训练数据,而测试数据仅包含协变量。在此场景中,我们的主要目标是预测测试数据中缺失的结果变量。基于这一目标,我们在协变量漂移(即训练数据与测试数据的协变量分布不同)条件下训练参数化回归模型。针对该问题,现有研究提出通过密度比进行重要性加权的协变量漂移适应方法。该方法将训练数据的损失函数以估计的密度比(训练数据与测试数据的协变量密度之比)加权后取平均,从而近似测试数据的风险。尽管该方法能获得测试数据风险最小化器,但其性能严重依赖于密度比估计的准确性。此外,即使密度比可被一致估计,密度比的估计误差仍会导致回归模型参数估计量产生偏差。为缓解这些问题,我们引入一种通过重要性加权的协变量漂移适应双稳健估计量,该估计量额外包含回归函数的估计量。利用双机器学习技术,我们的估计量可减少由密度比估计误差引起的偏差。我们展示了回归参数估计量的渐近分布。值得注意的是,只要密度比估计量或回归函数中有一者具备一致性,我们的估计量即可保持一致性,这体现了其对抗密度比估计潜在误差的鲁棒性。最后,通过仿真研究验证了提出方法的有效性。