This paper presents a CLIP-based unsupervised learning method for annotation-free multi-label image classification, including three stages: initialization, training, and inference. At the initialization stage, we take full advantage of the powerful CLIP model and propose a novel approach to extend CLIP for multi-label predictions based on global-local image-text similarity aggregation. To be more specific, we split each image into snippets and leverage CLIP to generate the similarity vector for the whole image (global) as well as each snippet (local). Then a similarity aggregator is introduced to leverage the global and local similarity vectors. Using the aggregated similarity scores as the initial pseudo labels at the training stage, we propose an optimization framework to train the parameters of the classification network and refine pseudo labels for unobserved labels. During inference, only the classification network is used to predict the labels of the input image. Extensive experiments show that our method outperforms state-of-the-art unsupervised methods on MS-COCO, PASCAL VOC 2007, PASCAL VOC 2012, and NUS datasets and even achieves comparable results to weakly supervised classification methods.
翻译:本文提出了一种基于CLIP的无监督学习方法,用于无需标注的多标签图像分类,包括三个阶段:初始化、训练和推理。在初始化阶段,我们充分利用强大的CLIP模型,并提出了一种新颖的方法,基于全局-局部图像-文本相似度聚合,将CLIP扩展应用于多标签预测。具体来说,我们将每张图像分割成片段,并利用CLIP为整张图像(全局)以及每个片段(局部)生成相似度向量。然后,引入一个相似度聚合器来利用全局和局部相似度向量。在训练阶段,将聚合后的相似度分数作为初始伪标签,我们提出了一个优化框架来训练分类网络的参数,并优化未观测标签的伪标签。在推理过程中,仅使用分类网络来预测输入图像的标签。大量实验表明,我们的方法在MS-COCO、PASCAL VOC 2007、PASCAL VOC 2012和NUS数据集上优于最先进的无监督方法,甚至取得了与弱监督分类方法相当的结果。