Clustering is a well-known unsupervised machine learning approach capable of automatically grouping discrete sets of instances with similar characteristics. Constrained clustering is a semi-supervised extension to this process that can be used when expert knowledge is available to indicate constraints that can be exploited. Well-known examples of such constraints are must-link (indicating that two instances belong to the same group) and cannot-link (two instances definitely do not belong together). The research area of constrained clustering has grown significantly over the years with a large variety of new algorithms and more advanced types of constraints being proposed. However, no unifying overview is available to easily understand the wide variety of available methods, constraints and benchmarks. To remedy this, this study presents in-detail the background of constrained clustering and provides a novel ranked taxonomy of the types of constraints that can be used in constrained clustering. In addition, it focuses on the instance-level pairwise constraints, and gives an overview of its applications and its historical context. Finally, it presents a statistical analysis covering 307 constrained clustering methods, categorizes them according to their features, and provides a ranking score indicating which methods have the most potential based on their popularity and validation quality. Finally, based upon this analysis, potential pitfalls and future research directions are provided.
翻译:聚类是一种著名的无监督机器学习方法,能够自动将具有相似特征的离散实例集合分组。约束聚类是该过程的半监督扩展,当专家知识可用于指示可利用的约束时即可应用。此类约束的典型示例包括"必连"(表示两个实例属于同一组)和"勿连"(两个实例必定不属于同一组)。约束聚类研究领域多年来显著发展,涌现出大量新型算法和更高级的约束类型。然而,目前缺乏统一的综述来帮助理解现有的多样化方法、约束与基准测试。为解决这一问题,本研究详细阐述了约束聚类的背景知识,提出了新型分级分类体系来归纳约束聚类中可用的约束类型。此外,重点分析了实例级成对约束,概述了其应用领域与历史背景。最终对307种约束聚类方法进行统计分析,根据其特性进行分类,并基于流行度与验证质量给出排名评分以揭示最具潜力的方法。基于此分析,进一步指出潜在陷阱与未来研究方向。