Semi-supervised clustering is a basic problem in various applications. Most existing methods require knowledge of the ideal cluster number, which is often difficult to obtain in practice. Besides, satisfying the must-link constraints is another major challenge for these methods. In this work, we view the semi-supervised clustering task as a partitioning problem on a graph associated with the given dataset, where the similarity matrix includes a scaling parameter to reflect the must-link constraints. Utilizing a relaxation technique, we formulate the graph partitioning problem into a continuous optimization model that does not require the exact cluster number, but only an overestimate of it. We then propose a block coordinate descent algorithm to efficiently solve this model, and establish its convergence result. Based on the obtained solution, we can construct the clusters that theoretically meet the must-link constraints under mild assumptions. Furthermore, we verify the effectiveness and efficiency of our proposed method through comprehensive numerical experiments.
翻译:半监督聚类是众多应用中的一个基础问题。现有方法大多需要已知理想的聚类数目,而这在实践中往往难以获取。此外,满足必须链接约束也是这些方法面临的主要挑战之一。本文中,我们将半监督聚类任务视为与给定数据集相关联的图划分问题,其中相似度矩阵包含一个用于反映必须链接约束的缩放参数。通过采用松弛技术,我们将图划分问题表述为一个连续优化模型,该模型无需确切的聚类数目,而仅需其一个上界估计。随后,我们提出一种块坐标下降算法来高效求解该模型,并建立了其收敛性结果。基于所得解,我们可以在温和假设下构建理论上满足必须链接约束的聚类。此外,我们通过全面的数值实验验证了所提方法的有效性与效率。