Quantification learning deals with the task of estimating the target label distribution under label shift. In this paper, we first present a unifying framework, distribution feature matching (DFM), that recovers as particular instances various estimators introduced in previous literature. We derive a general performance bound for DFM procedures, improving in several key aspects upon previous bounds derived in particular cases. We then extend this analysis to study robustness of DFM procedures in the misspecified setting under departure from the exact label shift hypothesis, in particular in the case of contamination of the target by an unknown distribution. These theoretical findings are confirmed by a detailed numerical study on simulated and real-world datasets. We also introduce an efficient, scalable and robust version of kernel-based DFM using the Random Fourier Feature principle.
翻译:量化学习旨在解决标签偏移下目标标签分布的估计问题。本文首先提出一个统一框架——分布特征匹配(DFM),该框架能恢复先前文献中提出的各类估计量作为其特例。我们推导了DFM方法的一般性能界,在多个关键方面改进了先前针对特定情况导出的性能界。随后,我们将此分析扩展到研究DFM方法在错误设定场景下的鲁棒性,即当完全标签偏移假设被违反时(特别是目标分布被未知分布污染的情况)。这些理论发现通过模拟数据集和真实世界数据集的详细数值研究得到验证。此外,我们基于随机傅里叶特征原理,引入了一种高效、可扩展且鲁棒的核化DFM版本。