Statistical machine learning methods often face the challenge of limited data available from the population of interest. One remedy is to leverage data from auxiliary source populations, which share some conditional distributions or are linked in other ways with the target domain. Techniques leveraging such \emph{dataset shift} conditions are known as \emph{domain adaptation} or \emph{transfer learning}. Despite extensive literature on dataset shift, limited works address how to efficiently use the auxiliary populations to improve the accuracy of risk evaluation for a given machine learning task in the target population. In this paper, we study the general problem of efficiently estimating target population risk under various dataset shift conditions, leveraging semiparametric efficiency theory. We consider a general class of dataset shift conditions, which includes three popular conditions -- covariate, label and concept shift -- as special cases. We allow for partially non-overlapping support between the source and target populations. We develop efficient and multiply robust estimators along with a straightforward specification test of these dataset shift conditions. We also derive efficiency bounds for two other dataset shift conditions, posterior drift and location-scale shift. Simulation studies support the efficiency gains due to leveraging plausible dataset shift conditions.
翻译:统计机器学习方法常面临目标群体可用数据有限的挑战。一种解决方法是利用辅助源群体的数据,这些群体与目标域共享某些条件分布或存在其他关联。利用此类“数据集偏移”条件的技术被称为“域适应”或“迁移学习”。尽管数据集偏移文献丰富,但针对如何有效利用辅助群体提升目标群体中给定机器学习任务风险评估精度的研究仍然有限。本文基于半参数效率理论,研究在多种数据集偏移条件下高效估计目标群体风险的通用问题。我们考虑一类通用的数据集偏移条件,涵盖三种常见情形——协变量偏移、标签偏移和概念偏移——作为特例。允许源群体与目标群体的支持集存在部分非重叠区域。我们开发了高效且多重稳健的估计量,并配套提出这些数据集偏移条件的直接设定检验方法。针对另外两种数据集偏移条件——后验漂移与位置尺度偏移——推导了效率边界。仿真研究表明,利用合理的假设条件可带来效率提升。