Improving importance estimation in covariate shift for providing accurate prediction error

In traditional Machine Learning, the algorithms predictions are based on the assumption that the data follows the same distribution in both the training and the test datasets. However, in real world data this condition does not hold and, for instance, the distribution of the covariates changes whereas the conditional distribution of the targets remains unchanged. This situation is called covariate shift problem where standard error estimation may be no longer accurate. In this context, the importance is a measure commonly used to alleviate the influence of covariate shift on error estimations. The main drawback is that it is not easy to compute. The Kullback-Leibler Importance Estimation Procedure (KLIEP) is capable of estimating importance in a promising way. Despite its good performance, it fails to ignore target information, since it only includes the covariates information for computing the importance. In this direction, this paper explores the potential performance improvement if target information is considered in the computation of the importance. Then, a redefinition of the importance arises in order to be generalized in this way. Besides the potential improvement in performance, including target information make possible the application to a real application about plankton classification that motivates this research and characterized by its great dimensionality, since considering targets rather than covariates reduces the computation and the noise in the covariates. The impact of taking target information is also explored when Logistic Regression (LR), Kernel Mean Matching (KMM), Ensemble Kernel Mean Matching (EKMM) and the naive predecessor of KLIEP called Kernel Density Estimation (KDE) methods estimate the importance. The experimental results lead to a more accurate error estimation using target information, especially in case of the more promising method KLIEP.

翻译：在传统机器学习中，算法的预测基于训练数据集和测试数据集遵循相同分布的假设。然而，现实世界数据中这一条件往往不成立，例如协变量的分布发生变化，而目标变量的条件分布保持不变。这种情况被称为协变量偏移问题，此时标准误差估计可能不再准确。在此背景下，重要性是一种常用于减轻协变量偏移对误差估计影响的度量。其主要缺点是不易计算。Kullback-Leibler重要性估计程序（KLIEP）能够以有前景的方式估计重要性。尽管其性能良好，但该方法在计算重要性时仅包含协变量信息，未能利用目标信息。为此，本文探讨了在重要性计算中纳入目标信息后潜在的性能改进。进而，我们对重要性进行了重新定义，以使其在这种广义形式下适用。除了性能提升的可能性外，纳入目标信息还使得该方法可应用于本文所启发的浮游生物分类实际应用，该应用以其高维性为特征，因为考虑目标而非协变量可减少计算量和协变量中的噪声。本文还探究了在逻辑回归（LR）、核均值匹配（KMM）、集成核均值匹配（EKMM）以及KLIEP的前身——核密度估计（KDE）方法估计重要性时，引入目标信息的影响。实验结果表明，使用目标信息能够实现更准确的误差估计，尤其是在性能更优的KLIEP方法中。