In this paper, we propose an extension for semi-supervised Minimum Sum-of-Squares Clustering (MSSC) problems of MDEClust, a memetic framework based on the Differential Evolution paradigm for unsupervised clustering. In semi-supervised MSSC, background knowledge is available in the form of (instance-level) "must-link" and "cannot-link" constraints, each of which indicating if two dataset points should be associated to the same or to a different cluster, respectively. The presence of such constraints makes the problem at least as hard as its unsupervised version and, as a consequence, some framework operations need to be carefully designed to handle this additional complexity: for instance, it is no more true that each point is associated to its nearest cluster center. As far as we know, our new framework, called S-MDEClust, represents the first memetic methodology designed to generate a (hopefully) optimal feasible solution for semi-supervised MSSC problems. Results of thorough computational experiments on a set of well-known as well as synthetic datasets show the effectiveness and efficiency of our proposal.
翻译:本文提出了一种针对半监督最小平方和聚类(MSSC)问题的MDEClust扩展方法。MDEClust是一种基于差分进化范式的模因框架,用于无监督聚类。在半监督MSSC中,背景知识以(实例级)“必须链接”和“不能链接”约束的形式存在,分别指示两个数据点是否应归属于同一聚类或不同聚类。此类约束的存在使得该问题至少与其无监督版本同样困难,因此需要精心设计框架中的某些操作以应对这种额外复杂性:例如,“每个点都关联到其最近聚类中心”这一性质不再成立。据我们所知,我们提出的新框架(称为S-MDEClust)是首个旨在为半监督MSSC问题生成(期望)最优可行解的模因方法。在一系列经典数据集及合成数据集上进行的全面计算实验结果表明,我们提出的方法具有显著的有效性和高效性。