Contrastive Language Image Pre-training (CLIP) has recently demonstrated success across various tasks due to superior feature representation empowered by image-text contrastive learning. However, the instance discrimination method used by CLIP can hardly encode the semantic structure of training data. To handle this limitation, cluster discrimination has been proposed through iterative cluster assignment and classification. Nevertheless, most cluster discrimination approaches only define a single pseudo-label for each image, neglecting multi-label signals in the image. In this paper, we propose a novel Multi-Label Cluster Discrimination method named MLCD to enhance representation learning. In the clustering step, we first cluster the large-scale LAION-400M dataset into one million centers based on off-the-shelf embedding features. Considering that natural images frequently contain multiple visual objects or attributes, we select the multiple closest centers as auxiliary class labels. In the discrimination step, we design a novel multi-label classification loss, which elegantly separates losses from positive classes and negative classes, and alleviates ambiguity on decision boundary. We validate the proposed multi-label cluster discrimination method with experiments on different scales of models and pre-training datasets. Experimental results show that our method achieves state-of-the-art performance on multiple downstream tasks including linear probe, zero-shot classification, and image-text retrieval. Code and models have been released at https://github.com/deepglint/unicom .
翻译:对比语言图像预训练(CLIP)近期因其通过图像-文本对比学习获得的卓越特征表示能力,在多种任务中取得了成功。然而,CLIP所采用的实例判别方法难以编码训练数据的语义结构。为应对这一局限,研究者提出了通过迭代聚类分配与分类实现的聚类判别方法。然而,大多数聚类判别方法仅为每张图像定义单个伪标签,忽视了图像中的多标签信号。本文提出一种名为MLCD的新型多标签聚类判别方法以增强表征学习。在聚类步骤中,我们首先基于现成的嵌入特征将大规模LAION-400M数据集聚类为一百万个中心。考虑到自然图像常包含多个视觉对象或属性,我们选择多个最近邻中心作为辅助类别标签。在判别步骤中,我们设计了一种新颖的多标签分类损失函数,该函数巧妙分离了正类与负类的损失,并缓解了决策边界上的模糊性。我们通过在不同规模的模型和预训练数据集上的实验验证了所提出的多标签聚类判别方法。实验结果表明,我们的方法在包括线性探测、零样本分类和图像-文本检索在内的多个下游任务中达到了最先进的性能。代码与模型已发布于https://github.com/deepglint/unicom。