This paper proposes a novel, nonparametric, interpoint distance-based measure to investigate whether there exist any groups in a set of given data, and if so then, how many groups are prevailing in total. It is a cluster accuracy index useful for arbitrary-dimensional data set, in association with any clustering algorithm having the number of groups specified as a priori. We perform univariate, nonparametric, multiple statistical tests of hypotheses, where as many dependent tests as the sample size are carried out using the interpoint distances. They possess $p$-values to be combined to reach a decision, which is taken in a step-wise process for a possible number of clusters. It reduces the unnecessary computations compared with the other accuracy measures from the literature. Data study establishes the proposed index's efficiency and superiority.
翻译:本文提出了一种新颖的、非参数的、基于点间距离的度量方法,用于研究给定数据集中是否存在任何分组,如果存在,则确定总共存在多少分组。该度量是一个聚类准确性指标,适用于任意维度的数据集,并与任何预先指定分组数量的聚类算法配合使用。我们执行单变量、非参数的多重统计假设检验,其中使用点间距离进行与样本量相同数量的相依检验。这些检验具有 $p$-值,通过逐步过程合并这些 $p$-值以对可能的聚类数量做出决策。与文献中的其他准确性度量相比,该方法减少了不必要的计算。数据研究验证了所提指标的有效性和优越性。