Clustering ensemble has been a popular research topic in data science due to its ability to improve the robustness of the single clustering method. Many clustering ensemble methods have been proposed, most of which can be categorized into clustering-view and sample-view methods. The clustering-view method is generally efficient, but it could be affected by the unreliability that existed in base clustering results. The sample-view method shows good performance, while the construction of the pairwise sample relation is time-consuming. In this paper, the clustering ensemble is formulated as a k-HyperEdge Medoids discovery problem and a clustering ensemble method based on k-HyperEdge Medoids that considers the characteristics of the above two types of clustering ensemble methods is proposed. In the method, a set of hyperedges is selected from the clustering view efficiently, then the hyperedges are diffused and adjusted from the sample view guided by a hyperedge loss function to construct an effective k-HyperEdge Medoid set. The loss function is mainly reduced by assigning samples to the hyperedge with the highest degree of belonging. Theoretical analyses show that the solution can approximate the optimal, the assignment method can gradually reduce the loss function, and the estimation of the belonging degree is statistically reasonable. Experiments on artificial data show the working mechanism of the proposed method. The convergence of the method is verified by experimental analysis of twenty data sets. The effectiveness and efficiency of the proposed method are also verified on these data, with nine representative clustering ensemble algorithms as reference.
翻译:聚类集成因其能提升单一聚类方法的鲁棒性,已成为数据科学领域的热门研究方向。目前已提出多种聚类集成方法,其中多数可归类为聚类视角方法与样本视角方法。聚类视角方法通常效率较高,但可能受到基聚类结果中存在的不可靠性影响。样本视角方法表现出良好性能,然而成对样本关系的构建过程耗时较长。本文将聚类集成问题形式化为k-超边中心点发现问题,并提出一种综合考虑上述两类聚类集成方法特性的k-超边中心点聚类集成方法。该方法首先从聚类视角高效选取超边集合,随后在超边损失函数的指导下从样本视角对超边进行扩散与调整,以构建有效的k-超边中心点集合。损失函数主要通过将样本分配至隶属度最高的超边来实现优化。理论分析表明:该解可逼近最优解,分配方法能逐步降低损失函数,且隶属度估计在统计上是合理的。人工数据实验揭示了所提方法的工作机制。通过对二十个数据集的实验分析验证了该方法的收敛性。以九种代表性聚类集成算法为参照,在这些数据上进一步验证了所提方法的有效性与高效性。