We develop and analyze a principled approach to kernel ridge regression under covariate shift. The goal is to learn a regression function with small mean squared error over a target distribution, based on unlabeled data from there and labeled data that may have a different feature distribution. We propose to split the labeled data into two subsets, and conduct kernel ridge regression on them separately to obtain a collection of candidate models and an imputation model. We use the latter to fill the missing labels and then select the best candidate accordingly. Our non-asymptotic excess risk bounds demonstrate that our estimator adapts effectively to both the structure of the target distribution and the covariate shift. This adaptation is quantified through a notion of effective sample size that reflects the value of labeled source data for the target regression task. Our estimator achieves the minimax optimal error rate up to a polylogarithmic factor, and we find that using pseudo-labels for model selection does not significantly hinder performance.
翻译:本文针对协变量偏移下的核岭回归问题,提出并分析了一种基于原理的方法。该方法旨在利用目标分布的无标记数据以及可能具有不同特征分布的标记数据,学习一个在目标分布上具有较小均方误差的回归函数。我们提出将标记数据划分为两个子集,分别对它们进行核岭回归,从而得到一组候选模型和一个插补模型。利用后者填补缺失的标签,并据此选择最佳候选模型。我们的非渐近超额风险界表明,该估计器能有效适应目标分布的结构和协变量偏移。这种适应性通过有效样本量的概念进行量化,该概念反映了标记源数据对目标回归任务的价值。我们的估计器达到了极小极大最优误差率(至多相差一个多对数因子),并且我们发现使用伪标签进行模型选择不会显著影响性能。