Quantification learning deals with the task of estimating the target label distribution under label shift. In this paper, we first present a unifying framework, distribution feature matching (DFM), that recovers as particular instances various estimators introduced in previous literature. We derive a general performance bound for DFM procedures, improving in several key aspects upon previous bounds derived in particular cases. We then extend this analysis to study robustness of DFM procedures in the misspecified setting under departure from the exact label shift hypothesis, in particular in the case of contamination of the target by an unknown distribution. These theoretical findings are confirmed by a detailed numerical study on simulated and real-world datasets. We also introduce an efficient, scalable and robust version of kernel-based DFM using the Random Fourier Feature principle.
翻译:量化学习致力于解决标签分布偏移场景下目标标签分布的估计问题。本文首先提出一个统一框架——分布特征匹配(DFM),该框架可将以往文献中提出的多种估计量作为特例统一还原。我们推导了DFM方法的一般性能界,在多个关键方面改进了此前针对特定情形推导的性能界。进而将分析扩展至模型设定错误情形——当实际数据偏离精确标签偏移假设时(特别是目标域被未知分布污染的情况),系统研究了DFM方法的鲁棒性。这些理论发现通过在模拟数据集和真实世界数据集上的详细数值实验得到验证。同时,我们引入了一种基于随机傅立叶特征原理的高效、可扩展且鲁棒的核化DFM算法。