Pseudo labeling is a popular and effective method to leverage the information of unlabeled data. Conventional instance-aware pseudo labeling methods often assign each unlabeled instance with a pseudo label based on its predicted probabilities. However, due to the unknown number of true labels, these methods cannot generalize well to semi-supervised multi-label learning (SSMLL) scenarios, since they would suffer from the risk of either introducing false positive labels or neglecting true positive ones. In this paper, we propose to solve the SSMLL problems by performing Class-distribution-Aware Pseudo labeling (CAP), which encourages the class distribution of pseudo labels to approximate the true one. Specifically, we design a regularized learning framework consisting of the class-aware thresholds to control the number of pseudo labels for each class. Given that the labeled and unlabeled examples are sampled according to the same distribution, we determine the thresholds by exploiting the empirical class distribution, which can be treated as a tight approximation to the true one. Theoretically, we show that the generalization performance of the proposed method is dependent on the pseudo labeling error, which can be significantly reduced by the CAP strategy. Extensive experimental results on multiple benchmark datasets validate that CAP can effectively solve the SSMLL problems.
翻译:伪标签法是一种利用未标记数据信息的常用且有效方法。传统的实例级伪标签法通常基于预测概率为每个未标记实例分配伪标签。然而,由于真实标签数量未知,这类方法难以直接推广到半监督多标签学习(SSMLL)场景,因为它们会面临引入假阳性标签或忽略真阳性标签的风险。本文提出通过执行类别分布感知的伪标签法(CAP)解决SSMLL问题,该方法鼓励伪标签的类别分布逼近真实分布。具体而言,我们设计了一个正则化学习框架,其中包含类别感知阈值以控制每类伪标签的数量。鉴于标记与未标记样本遵循同一分布采样,我们利用经验类分布(可视为真实分布的紧致近似)确定阈值。理论上,我们证明所提方法的泛化性能取决于伪标签误差,而CAP策略能显著降低该误差。在多个基准数据集上的广泛实验结果表明,CAP能有效解决SSMLL问题。