Making Binary Classification from Multiple Unlabeled Datasets Almost Free of Supervision

Training a classifier exploiting a huge amount of supervised data is expensive or even prohibited in a situation, where the labeling cost is high. The remarkable progress in working with weaker forms of supervision is binary classification from multiple unlabeled datasets which requires the knowledge of exact class priors for all unlabeled datasets. However, the availability of class priors is restrictive in many real-world scenarios. To address this issue, we propose to solve a new problem setting, i.e., binary classification from multiple unlabeled datasets with only one pairwise numerical relationship of class priors (MU-OPPO), which knows the relative order (which unlabeled dataset has a higher proportion of positive examples) of two class-prior probabilities for two datasets among multiple unlabeled datasets. In MU-OPPO, we do not need the class priors for all unlabeled datasets, but we only require that there exists a pair of unlabeled datasets for which we know which unlabeled dataset has a larger class prior. Clearly, this form of supervision is easier to be obtained, which can make labeling costs almost free. We propose a novel framework to handle the MU-OPPO problem, which consists of four sequential modules: (i) pseudo label assignment; (ii) confident example collection; (iii) class prior estimation; (iv) classifier training with estimated class priors. Theoretically, we analyze the gap between estimated class priors and true class priors under the proposed framework. Empirically, we confirm the superiority of our framework with comprehensive experiments. Experimental results demonstrate that our framework brings smaller estimation errors of class priors and better performance of binary classification.

翻译：训练一个利用大量监督数据的分类器在标注成本高昂的情况下可能代价昂贵甚至不可行。利用较弱监督形式的最新进展是基于多个无标签数据集的二元分类，这要求知道所有无标签数据集的精确类别先验。然而，在许多现实场景中，类别先验的可用性受到限制。为了解决这一问题，我们提出解决一种新的问题设置，即仅利用类别先验的一个成对数值关系从多个无标签数据集进行二元分类（MU-OPPO），该方法已知多个无标签数据集中两个数据集之间类别先验概率的相对顺序（即哪个无标签数据集具有更高的正例比例）。在MU-OPPO中，我们不需要所有无标签数据集的类别先验，仅需存在一对无标签数据集，且我们知道其中哪个无标签数据集具有更大的类别先验。显然，这种形式的监督更容易获得，从而几乎可以零成本进行标注。我们提出了一种新颖框架来处理MU-OPPO问题，该框架包含四个顺序模块：（i）伪标签分配；（ii）置信样本收集；（iii）类别先验估计；（iv）基于估计类别先验的分类器训练。理论上，我们分析了该框架下估计类别先验与真实类别先验之间的差距。实证上，我们通过全面的实验证实了该框架的优越性。实验结果表明，我们的框架能减少类别先验的估计误差，并提升二元分类的性能。