Due to their conceptual simplicity, k-means algorithm variants have been extensively used for unsupervised cluster analysis. However, one main shortcoming of these algorithms is that they essentially fit a mixture of identical spherical Gaussians to data that vastly deviates from such a distribution. In comparison, general Gaussian Mixture Models (GMMs) can fit richer structures but require estimating a quadratic number of parameters per cluster to represent the covariance matrices. This poses two main issues: (i) the underlying optimization problems are challenging due to their larger number of local minima, and (ii) their solutions can overfit the data. In this work, we design search strategies that circumvent both issues. We develop more effective optimization algorithms for general GMMs, and we combine these algorithms with regularization strategies that avoid overfitting. Through extensive computational analyses, we observe that optimization or regularization in isolation does not substantially improve cluster recovery. However, combining these techniques permits a completely new level of performance previously unachieved by k-means algorithm variants, unraveling vastly different cluster structures. These results shed new light on the current status quo between GMM and k-means methods and suggest the more frequent use of general GMMs for data exploration. To facilitate such applications, we provide open-source code as well as Julia packages (UnsupervisedClustering.jl and RegularizedCovarianceMatrices.jl) implementing the proposed techniques.
翻译:由于概念简单,k-means算法变体已被广泛用于无监督聚类分析。然而,这些算法的主要缺陷在于,它们本质上假设数据服从一组相同的球形高斯分布,而实际数据往往严重偏离这种分布。相比之下,通用高斯混合模型(GMM)能够拟合更丰富的结构,但需要为每个聚类估计二次数量的参数来表示协方差矩阵。这带来了两个主要问题:(i)由于局部极小值数量众多,底层优化问题具有挑战性;(ii)其解可能对数据过拟合。在本工作中,我们设计了能规避这两个问题的搜索策略。我们为通用GMM开发了更有效的优化算法,并将这些算法与避免过拟合的正则化策略相结合。通过大量计算分析,我们发现单独使用优化或正则化并不能显著提升聚类恢复效果。然而,将这两种技术结合后,性能达到了k-means算法变体此前无法企及的全新水平,揭示出截然不同的聚类结构。这些结果为GMM与k-means方法之间的现有格局提供了新见解,并建议更频繁地使用通用GMM进行数据探索。为促进此类应用,我们提供了开源代码以及实现所提技术的Julia包(UnsupervisedClustering.jl和RegularizedCovarianceMatrices.jl)。