We present a discriminative clustering approach in which the feature representation can be learned from data and moreover leverage labeled data. Representation learning can give a similarity-based clustering method the ability to automatically adapt to an underlying, yet hidden, geometric structure of the data. The proposed approach augments the DIFFRAC method with a representation learning capability, using a gradient-based stochastic training algorithm and an optimal transport algorithm with entropic regularization to perform the cluster assignment step. The resulting method is evaluated on several real datasets when varying the ratio of labeled data to unlabeled data and thereby interpolating between the fully unsupervised regime and the fully supervised regime. The experimental results suggest that the proposed method can learn powerful feature representations even in the fully unsupervised regime and can leverage even small amounts of labeled data to improve the feature representations and to obtain better clusterings of complex datasets.
翻译:我们提出一种判别式聚类方法,该方法能够从数据中学习特征表示,并有效利用标注数据。表示学习能够为基于相似度的聚类方法赋予自动适应数据中隐藏几何结构的能力。所提出的方法通过在DIFFRAC框架中引入表示学习能力,利用基于梯度的随机训练算法和熵正则化最优传输算法完成聚类分配步骤。该方法在多个真实数据集上进行了评估,通过改变标注数据与未标注数据的比例,实现了从完全无监督到完全监督场景的连续过渡。实验结果表明,即使在全无监督场景下,该方法也能学习强大的特征表示,并且可以仅利用少量标注数据来改进特征表示,从而对复杂数据集实现更优的聚类效果。