We provide necessary and sufficient conditions for the uniqueness of the k-means set of a probability distribution. This uniqueness problem is related to the choice of k: depending on the underlying distribution, some values of this parameter could lead to multiple sets of k-means, which hampers the interpretation of the results and/or the stability of the algorithms. We give a general assessment on consistency of the empirical k-means adapted to the setting of non-uniqueness and determine the asymptotic distribution of the within cluster sum of squares (WCSS). We also provide statistical characterizations of k-means uniqueness in terms of the asymptotic behavior of the empirical WCSS. As a consequence, we derive a bootstrap test for uniqueness of the set of k-means. The results are illustrated with examples of different types of non-uniqueness and we check by simulations the performance of the proposed methodology.
翻译:本文给出了概率分布的k-means集合唯一性的充分必要条件。该唯一性问题与参数k的选择相关:根据底层分布的不同,某些参数值可能导致多个k-means集合的存在,这会妨碍结果解释和/或算法稳定性。我们针对非唯一性场景给出了经验k-means一致性的总体评估,并确定了聚类内平方和(WCSS)的渐近分布。同时,我们通过经验WCSS的渐近行为提供了k-means唯一性的统计特征描述。基于此,我们推导出用于检验k-means集合唯一性的自助法检验。研究结果通过不同类型非唯一性案例加以说明,并通过仿真验证了所提出方法的性能。