Convex clustering is a modern method with both hierarchical and $k$-means clustering characteristics. Although convex clustering can capture complex clustering structures hidden in data, the existing convex clustering algorithms are not scalable to large data sets with sample sizes greater than several thousands. Moreover, it is known that convex clustering sometimes fails to produce a complete hierarchical clustering structure. This issue arises if clusters split up or the minimum number of possible clusters is larger than the desired number of clusters. In this paper, we propose convex clustering through majorization-minimization (CCMM) -- an iterative algorithm that uses cluster fusions and a highly efficient updating scheme derived using diagonal majorization. Additionally, we explore different strategies to ensure that the hierarchical clustering structure terminates in a single cluster. With a current desktop computer, CCMM efficiently solves convex clustering problems featuring over one million objects in seven-dimensional space, achieving a solution time of 51 seconds on average.
翻译:凸聚类是一种兼具分层聚类与$k$-均值聚类特性的现代方法。尽管凸聚类能够捕捉数据中隐藏的复杂聚类结构,但现有凸聚类算法难以扩展至样本量超过数千的大规模数据集。此外,已知凸聚类有时无法生成完整的分层聚类结构——这一问题会在簇发生分裂或最小可能簇数大于期望簇数时出现。本文提出基于最大化-最小化算法的凸聚类(CCMM)——一种利用簇融合与基于对角主导化推导的高效更新方案的迭代算法。同时,我们探索了不同策略以确保分层聚类结构最终收敛为单一簇。在当前的台式计算机上,CCMM可高效求解七维空间中超过一百万对象体的凸聚类问题,平均求解时间仅需51秒。