In this paper, we deal with semi-supervised Minimum Sum-of-Squares Clustering (MSSC) problems where background knowledge is given in the form of instance-level constraints. In particular, we take into account "must-link" and "cannot-link" constraints, each of which indicates if two dataset points should be associated to the same or to a different cluster. The presence of such constraints makes the problem at least as hard as its unsupervised version: it is no more true that each point is associated to its nearest cluster center, thus requiring some modifications in crucial operations, such as the assignment step. In this scenario, we propose a novel memetic strategy based on the Differential Evolution paradigm, directly extending a state-of-the-art framework recently proposed in the unsupervised clustering literature. As far as we know, our contribution represents the first attempt to define a memetic methodology designed to generate a (hopefully) optimal feasible solution for the semi-supervised MSSC problem. The proposal is compared with some state-of-the-art algorithms from the literature on a set of well-known datasets, highlighting its effectiveness and efficiency in finding good quality clustering solutions.
翻译:本文研究了半监督最小平方和聚类问题,其中背景知识以实例级约束的形式给出。具体而言,我们考虑了"必连"和"禁连"两类约束,分别指示两个数据点应归属于同一聚类或不同聚类。此类约束的存在使得问题的难度至少等同于其无监督版本:每个数据点不再必然关联到最近的聚类中心,因此需要在分配步骤等关键操作中进行相应调整。针对该场景,我们提出了一种基于差分进化范式的新型模因策略,直接扩展了无监督聚类文献中近期提出的先进框架。据我们所知,本文贡献在于首次尝试定义一种专门生成半监督最小平方和聚类问题(尽可能)最优可行解的模因方法。通过与文献中若干经典算法在知名数据集上的对比实验,验证了本方法在获取高质量聚类解方面的有效性和高效性。