Several disciplines, like the social sciences, epidemiology, sentiment analysis, or market research, are interested in knowing the distribution of the classes in a population rather than the individual labels of the members thereof. Quantification is the supervised machine learning task concerned with obtaining accurate predictors of class prevalence, and to do so particularly in the presence of label shift. The distribution-matching (DM) approaches represent one of the most important families among the quantification methods that have been proposed in the literature so far. Current DM approaches model the involved populations by means of histograms of posterior probabilities. In this paper, we argue that their application to the multiclass setting is suboptimal since the histograms become class-specific, thus missing the opportunity to model inter-class information that may exist in the data. We propose a new representation mechanism based on multivariate densities that we model via kernel density estimation (KDE). The experiments we have carried out show our method, dubbed KDEy, yields superior quantification performance with respect to previous DM approaches. We also investigate the KDE-based representation within the maximum likelihood framework and show KDEy often shows superior performance with respect to the expectation-maximization method for quantification, arguably the strongest contender in the quantification arena to date.
翻译:社会科学、流行病学、情感分析或市场研究等若干学科关注的并非群体个体的类别标签,而是群体类别分布情况。量化是监督式机器学习任务,旨在获取准确的类别先验概率预测器,尤其适用于存在标签偏移的场景。分布匹配(DM)方法代表了迄今为止文献中提出的量化方法中最重要的分支之一。当前DM方法通过后验概率直方图对相关群体进行建模。本文指出,将其应用于多类场景存在次优性,因为直方图变为类别特定,从而丧失了建模数据中可能存在的跨类别信息。我们提出基于多元密度的新型表示机制,并通过核密度估计(KDE)进行建模。实验结果表明,我们提出的KDEy方法相比以往DM方法具有更优的量化性能。同时,我们在最大似然框架下研究了基于KDE的表示方法,并证明KDEy在量化性能上通常优于当前量化领域最具竞争力的期望最大化方法。