Clustering methods are popular for revealing structure in data, particularly in the high-dimensional setting common to contemporary data science. A central statistical question is, "are the clusters really there?" One pioneering method in statistical cluster validation is SigClust, but it is severely underpowered in the important setting where the candidate clusters have unbalanced sizes, such as in rare subtypes of disease. We show why this is the case, and propose a remedy that is powerful in both the unbalanced and balanced settings, using a novel generalization of k-means clustering. We illustrate the value of our method using a high-dimensional dataset of gene expression in kidney cancer patients. A Python implementation is available at https://github.com/thomaskeefe/sigclust.
翻译:聚类方法在揭示数据内部结构方面广受欢迎,尤其适用于当代数据科学中常见的高维场景。一个核心统计问题是:"这些聚类是否真实存在?"统计聚类验证领域的开创性方法之一SigClust,在候选聚类规模不平衡(如疾病罕见亚型分析)的重要场景中效能严重不足。我们揭示了导致这一现象的根本原因,并提出了一种改进方案——通过对k-means聚类的新型泛化方法,使其在不平衡与平衡聚类场景中均具有强效性。我们利用肾癌患者基因表达的高维数据集验证了该方法的价值。Python实现代码详见https://github.com/thomaskeefe/sigclust。