Supervised learning is often affected by a covariate shift in which the marginal distributions of instances (covariates $x$) of training and testing samples $\mathrm{p}_\text{tr}(x)$ and $\mathrm{p}_\text{te}(x)$ are different but the label conditionals coincide. Existing approaches address such covariate shift by either using the ratio $\mathrm{p}_\text{te}(x)/\mathrm{p}_\text{tr}(x)$ to weight training samples (reweighted methods) or using the ratio $\mathrm{p}_\text{tr}(x)/\mathrm{p}_\text{te}(x)$ to weight testing samples (robust methods). However, the performance of such approaches can be poor under support mismatch or when the above ratios take large values. We propose a minimax risk classification (MRC) approach for covariate shift adaptation that avoids such limitations by weighting both training and testing samples. In addition, we develop effective techniques that obtain both sets of weights and generalize the conventional kernel mean matching method. We provide novel generalization bounds for our method that show a significant increase in the effective sample size compared with reweighted methods. The proposed method also achieves enhanced classification performance in both synthetic and empirical experiments.
翻译:监督学习常受协变量偏移影响,此时训练样本与测试样本中实例(协变量$x$)的边际分布$\mathrm{p}_\text{tr}(x)$和$\mathrm{p}_\text{te}(x)$不同,但标签条件分布一致。现有方法通过两种方式处理此类偏移:使用比率$\mathrm{p}_\text{te}(x)/\mathrm{p}_\text{tr}(x)$对训练样本加权(重加权方法),或使用比率$\mathrm{p}_\text{tr}(x)/\mathrm{p}_\text{te}(x)$对测试样本加权(鲁棒方法)。然而,在支持域不匹配或上述比率取值较大时,此类方法性能可能较差。我们提出一种最小最大风险分类(MRC)方法用于协变量偏移自适应,通过对训练样本和测试样本同时加权来规避上述局限性。此外,我们开发了能同时获得两组权重的有效技术,并对传统核均值匹配方法进行了推广。我们为所提方法提供了新颖的泛化界,表明相较于重加权方法,其有效样本量显著增加。在合成数据与实证实验中,该方法均取得了更优的分类性能。