Functions of the ratio of the densities $p/q$ are widely used in machine learning to quantify the discrepancy between the two distributions $p$ and $q$. For high-dimensional distributions, binary classification-based density ratio estimators have shown great promise. However, when densities are well separated, estimating the density ratio with a binary classifier is challenging. In this work, we show that the state-of-the-art density ratio estimators perform poorly on well-separated cases and demonstrate that this is due to distribution shifts between training and evaluation time. We present an alternative method that leverages multi-class classification for density ratio estimation and does not suffer from distribution shift issues. The method uses a set of auxiliary densities $\{m_k\}_{k=1}^K$ and trains a multi-class logistic regression to classify the samples from $p, q$, and $\{m_k\}_{k=1}^K$ into $K+2$ classes. We show that if these auxiliary densities are constructed such that they overlap with $p$ and $q$, then a multi-class logistic regression allows for estimating $\log p/q$ on the domain of any of the $K+2$ distributions and resolves the distribution shift problems of the current state-of-the-art methods. We compare our method to state-of-the-art density ratio estimators on both synthetic and real datasets and demonstrate its superior performance on the tasks of density ratio estimation, mutual information estimation, and representation learning. Code: https://www.blackswhan.com/mdre/
翻译:密度比 $p/q$ 的函数在机器学习中被广泛用于量化两个分布 $p$ 和 $q$ 之间的差异。对于高维分布,基于二分类的密度比估计方法展现出巨大潜力。然而,当密度分布高度分离时,使用二分类器估计密度比极具挑战性。在本工作中,我们证明当前最先进的密度比估计方法在高度分离情况下表现不佳,并指出这是由于训练与评估时的分布偏移所致。我们提出一种替代方法,利用多分类进行密度比估计,且不受分布偏移问题影响。该方法使用一组辅助密度 $\{m_k\}_{k=1}^K$,并训练一个多项逻辑回归模型,将来自 $p$、$q$ 和 $\{m_k\}_{k=1}^K$ 的样本分类为 $K+2$ 个类别。我们证明,若这些辅助密度构建为与 $p$ 和 $q$ 存在重叠,则多项逻辑回归能够估计任意 $K+2$ 个分布域上的 $\log p/q$,从而解决当前最先进方法中的分布偏移问题。我们将所提方法与合成数据集及真实数据集上的最先进密度比估计器进行比较,并在密度比估计、互信息估计和表示学习任务中展示其卓越性能。代码:https://www.blackswhan.com/mdre/