This paper presents a novel approach to Single-Positive Multi-label Learning. In general multi-label learning, a model learns to predict multiple labels or categories for a single input image. This is in contrast with standard multi-class image classification, where the task is predicting a single label from many possible labels for an image. Single-Positive Multi-label Learning (SPML) specifically considers learning to predict multiple labels when there is only a single annotation per image in the training data. Multi-label learning is in many ways a more realistic task than single-label learning as real-world data often involves instances belonging to multiple categories simultaneously; however, most common computer vision datasets predominantly contain single labels due to the inherent complexity and cost of collecting multiple high quality annotations for each instance. We propose a novel approach called Vision-Language Pseudo-Labeling (VLPL), which uses a vision-language model to suggest strong positive and negative pseudo-labels, and outperforms the current SOTA methods by 5.5% on Pascal VOC, 18.4% on MS-COCO, 15.2% on NUS-WIDE, and 8.4% on CUB-Birds. Our code and data are available at https://github.com/mvrl/VLPL.
翻译:本文提出了一种针对单正标签多标签学习(Single-Positive Multi-label Learning, SPML)的新方法。传统多标签学习中,模型需要为单张输入图像预测多个标签或类别,这与标准多类图像分类(仅需从众多候选标签中预测单一标签)形成鲜明对比。单正标签多标签学习特别关注在训练数据中每张图像仅包含单个标注的情况下,如何学习预测多个标签。多标签学习在多个层面上比单标签学习更贴近实际任务,因为现实数据中的实例往往同时属于多个类别;然而,由于为每个实例收集多个高质量标注的固有复杂性和成本,大多数主流计算机视觉数据集主要包含单标签。我们提出了一种名为"视觉-语言伪标签方法"(Vision-Language Pseudo-Labeling, VLPL)的创新框架,该方法利用视觉-语言模型生成强正负伪标签,在Pascal VOC、MS-COCO、NUS-WIDE和CUB-Birds数据集上分别超越当前最先进方法5.5%、18.4%、15.2%和8.4%。我们的代码与数据已开源至https://github.com/mvrl/VLPL。