Clustering remains an important and challenging task of grouping samples into clusters without manual annotations. Recent works have achieved excellent results on small datasets by performing clustering on feature representations learned from self-supervised learning. However, for datasets with a large number of clusters, such as ImageNet, current methods still can not achieve high clustering performance. In this paper, we propose Contrastive Learning-based Clustering (CLC), which uses contrastive learning to directly learn cluster assignment. We decompose the representation into two parts: one encodes the categorical information under an equipartition constraint, and the other captures the instance-wise factors. We propose a contrastive loss using both parts of the representation. We theoretically analyze the proposed contrastive loss and reveal that CLC sets different weights for the negative samples while learning cluster assignments. Further gradient analysis shows that the larger weights tend to focus more on the hard negative samples. Therefore, the proposed loss has high expressiveness that enables us to efficiently learn cluster assignments. Experimental evaluation shows that CLC achieves overall state-of-the-art or highly competitive clustering performance on multiple benchmark datasets. In particular, we achieve 53.4% accuracy on the full ImageNet dataset and outperform existing methods by large margins (+ 10.2%).
翻译:聚类仍然是一项重要且具有挑战性的任务,旨在无需人工标注的情况下将样本分组到不同簇中。近期研究通过利用自监督学习获得的特征表示进行聚类,在小数据集上取得了优异效果。然而,对于包含大量簇的数据集(如ImageNet),现有方法仍无法达到较高的聚类性能。本文提出基于对比学习的聚类方法(CLC),利用对比学习直接学习簇分配。我们将表示分解为两部分:一部分在等分约束下编码类别信息,另一部分捕获实例层面的因素。我们利用这两部分表示构建对比损失函数。通过理论分析该对比损失,揭示了CLC在学习簇分配时为负样本设置了不同的权重。进一步的梯度分析表明,较大权重更倾向于关注困难负样本。因此,所提出的损失函数具有高表达能力,能够高效学习簇分配。实验评估表明,CLC在多个基准数据集上取得了整体最佳或极具竞争力的聚类性能。特别地,我们在完整ImageNet数据集上达到了53.4%的准确率,并以较大优势(+10.2%)超越现有方法。