Estimating the number of components is a fundamental challenge in unsupervised learning, particularly when dealing with high-dimensional data with many components or severely imbalanced component sizes. This paper addresses this challenge for classical Gaussian mixture models. The proposed estimator is simple: center the data, compute the singular values of the centered matrix, and count those above a threshold. No iterative fitting, no likelihood calculation, and no prior knowledge of the number of components are required. We prove that, under a mild separation condition on the component centers, the estimator consistently recovers the true number of components. The result holds in high-dimensional settings where the dimension can be much larger than the sample size. It also holds when the number of components grows to the smaller of the dimension and the sample size, even under severe imbalance among component sizes. Computationally, the method is extremely fast: for example, it processes ten million samples in one hundred dimensions within one minute. Extensive experimental studies confirm its accuracy in challenging settings such as high dimensionality, many components, and severe class imbalance.
翻译:估计分量数量是无监督学习中的一项基本挑战,尤其是在处理高维数据、包含大量分量或分量规模严重失衡的情况下。本文针对经典高斯混合模型解决了这一难题。所提出的估计器简单高效:将数据中心化,计算中心化矩阵的奇异值,并统计超过阈值的奇异值个数。该方法无需迭代拟合、无需计算似然函数,也无需事先了解分量数量。我们证明,在分量中心满足温和分离条件的情况下,该估计器能够一致地恢复真实的分量数量。该结论适用于维度远大于样本量的高维场景,也适用于分量数量增长至维度与样本量中较小者的情况,即使分量规模严重失衡依然成立。在计算方面,该方法极其快速:例如,它能在1分钟内处理一亿个样本、一百维的数据。大量实验研究表明,该方法在高维性、多分量及严重类别失衡等挑战性场景下具有准确性。