Supervised learning is often affected by a covariate shift in which the marginal distributions of instances (covariates $x$) of training and testing samples $\mathrm{p}_\text{tr}(x)$ and $\mathrm{p}_\text{te}(x)$ are different but the label conditionals coincide. Existing approaches address such covariate shift by either using the ratio $\mathrm{p}_\text{te}(x)/\mathrm{p}_\text{tr}(x)$ to weight training samples (reweighted methods) or using the ratio $\mathrm{p}_\text{tr}(x)/\mathrm{p}_\text{te}(x)$ to weight testing samples (robust methods). However, the performance of such approaches can be poor under support mismatch or when the above ratios take large values. We propose a minimax risk classification (MRC) approach for covariate shift adaptation that avoids such limitations by weighting both training and testing samples. In addition, we develop effective techniques that obtain both sets of weights and generalize the conventional kernel mean matching method. We provide novel generalization bounds for our method that show a significant increase in the effective sample size compared with reweighted methods. The proposed method also achieves enhanced classification performance in both synthetic and empirical experiments.
翻译:监督学习常受协变量偏移影响,即训练样本与测试样本中实例(协变量$x$)的边缘分布$\mathrm{p}_\text{tr}(x)$和$\mathrm{p}_\text{te}(x)$不同,但标签条件分布一致。现有方法通过采用$\mathrm{p}_\text{te}(x)/\mathrm{p}_\text{tr}(x)$比率加权训练样本(重加权方法)或采用$\mathrm{p}_\text{tr}(x)/\mathrm{p}_\text{te}(x)$比率加权测试样本(鲁棒方法)来处理此类偏移。然而,当存在支撑集不匹配或上述比率取较大值时,这类方法的性能可能较差。我们提出一种用于协变量偏移自适应的极小极大风险分类(MRC)方法,通过对训练样本和测试样本同时进行加权来避免此类局限性。此外,我们开发了能够同时获取两组权重的有效技术,并推广了传统的核均值匹配方法。我们为所提方法提供了新颖的泛化界,表明其有效样本量相较于重加权方法显著增加。在合成实验和实证实验中,该方法均实现了更优的分类性能。