This paper presents a CLIP-based unsupervised learning method for annotation-free multi-label image classification, including three stages: initialization, training, and inference. At the initialization stage, we take full advantage of the powerful CLIP model and propose a novel approach to extend CLIP for multi-label predictions based on global-local image-text similarity aggregation. To be more specific, we split each image into snippets and leverage CLIP to generate the similarity vector for the whole image (global) as well as each snippet (local). Then a similarity aggregator is introduced to leverage the global and local similarity vectors. Using the aggregated similarity scores as the initial pseudo labels at the training stage, we propose an optimization framework to train the parameters of the classification network and refine pseudo labels for unobserved labels. During inference, only the classification network is used to predict the labels of the input image. Extensive experiments show that our method outperforms state-of-the-art unsupervised methods on MS-COCO, PASCAL VOC 2007, PASCAL VOC 2012, and NUS datasets and even achieves comparable results to weakly supervised classification methods.
翻译:本文提出一种基于CLIP的无监督学习方法,用于免除人工标注的多标签图像分类任务,包含初始化、训练和推理三个阶段。在初始化阶段,我们充分利用强大的CLIP模型,提出一种基于全局-局部图像-文本相似度聚合的新方法,将CLIP扩展至多标签预测。具体而言,我们将每张图像分割为片段,利用CLIP生成整张图像(全局)及各片段(局部)的相似度向量,并引入相似度聚合器以融合全局与局部相似度向量。在训练阶段,以聚合后的相似度分数作为初始伪标签,我们提出一种优化框架来训练分类网络参数并优化未观测标签的伪标签。在推理阶段,仅使用分类网络预测输入图像的标签。大量实验表明,本方法在MS-COCO、PASCAL VOC 2007、PASCAL VOC 2012及NUS数据集上均优于现有最优无监督方法,甚至能达到与弱监督分类方法相媲美的结果。