Statistical machine learning methods often face the challenge of limited data available from the population of interest. One remedy is to leverage data from auxiliary source populations, which share some conditional distributions or are linked in other ways with the target domain. Techniques leveraging such \emph{dataset shift} conditions are known as \emph{domain adaptation} or \emph{transfer learning}. Despite extensive literature on dataset shift, limited works address how to efficiently use the auxiliary populations to improve the accuracy of risk evaluation for a given machine learning task in the target population. In this paper, we study the general problem of efficiently estimating target population risk under various dataset shift conditions, leveraging semiparametric efficiency theory. We consider a general class of dataset shift conditions, which includes three popular conditions -- covariate, label and concept shift -- as special cases. We allow for partially non-overlapping support between the source and target populations. We develop efficient and multiply robust estimators along with a straightforward specification test of these dataset shift conditions. We also derive efficiency bounds for two other dataset shift conditions, posterior drift and location-scale shift. Simulation studies support the efficiency gains due to leveraging plausible dataset shift conditions.
翻译:统计机器学习方法通常面临目标群体可用数据有限的挑战。一种解决方案是利用辅助源群体的数据,这些群体与目标域共享某些条件分布或以其他方式相关联。利用此类数据集偏移条件的技术被称为域适应或迁移学习。尽管关于数据集偏移已有大量文献,但针对如何有效利用辅助群体提高目标群体中给定机器学习任务风险评估准确性的研究仍然有限。本文基于半参数效率理论,研究了在多种数据集偏移条件下高效估计目标群体风险的通用问题。我们考虑了一类一般性的数据集偏移条件,其中包含三种常见情形——协变量偏移、标签偏移和概念偏移作为特例。我们允许源群体与目标群体之间存在部分非重叠支持域。我们开发了高效且多重稳健的估计量,并提供了针对这些数据集偏移条件的直接设定检验。此外,我们推导了另外两种数据集偏移条件——后验漂移与位置尺度偏移——的效率界。仿真研究表明,利用合理的数据集偏移条件可带来效率提升。