We consider the problem of estimating the number of clusters (k) in a dataset. We propose a non-parametric approach to the problem that utilizes similarity graphs to construct a robust statistic that effectively captures similarity information among observations. This graph-based statistic is applicable to datasets of any dimension, is computationally efficient to obtain, and can be paired with any kind of clustering technique. Asymptotic theory is developed to establish the selection consistency of the proposed approach. Simulation studies demonstrate that the graph-based statistic outperforms existing methods for estimating k, especially in the high-dimensional setting. We illustrate its utility on an imaging dataset and an RNA-seq dataset.
翻译:我们考虑数据集中聚类数量(k)的估计问题。本文提出一种非参数方法,该方法利用相似性图构建稳健统计量,有效捕捉观测值之间的相似性信息。这种基于图的统计量适用于任意维度的数据集,计算效率高,并且可与任何类型的聚类技术结合使用。我们建立了渐近理论以证明所提方法的选择一致性。模拟研究表明,基于图的统计量在估计k值方面优于现有方法,尤其在高维设置下表现突出。我们通过成像数据集和RNA-seq数据集展示了该方法的实用性。