The objective of clusterability evaluation is to check whether a clustering structure exists within the data set. As a crucial yet often-overlooked issue in cluster analysis, it is essential to conduct such a test before applying any clustering algorithm. If a data set is unclusterable, any subsequent clustering analysis would not yield valid results. Despite its importance, the majority of existing studies focus on numerical data, leaving the clusterability evaluation issue for categorical data as an open problem. Here we present TestCat, a testing-based approach to assess the clusterability of categorical data in terms of an analytical $p$-value. The key idea underlying TestCat is that clusterable categorical data possess many strongly associated attribute pairs and hence the sum of chi-squared statistics of all attribute pairs is employed as the test statistic for $p$-value calculation. We apply our method to a set of benchmark categorical data sets, showing that TestCat outperforms those solutions based on existing clusterability evaluation methods for numeric data. To the best of our knowledge, our work provides the first way to effectively recognize the clusterability of categorical data in a statistically sound manner.
翻译:可聚类性评估的目标是检验数据集中是否存在聚类结构。作为聚类分析中一个关键但常被忽视的问题,在应用任何聚类算法之前进行此类检验至关重要。若数据集不具备可聚类性,任何后续的聚类分析都无法产生有效结果。尽管该问题十分重要,现有研究大多集中于数值数据,使得类别数据的可聚类性评估问题仍悬而未决。本文提出TestCat方法——一种基于假设检验的途径,通过解析$p$值来评估类别数据的可聚类性。TestCat的核心思想在于:可聚类的类别数据具有大量强关联的属性对,因此所有属性对的卡方统计量之和被用作$p$值计算的检验统计量。我们将该方法应用于一组基准类别数据集,结果表明TestCat的性能优于基于现有数值数据可聚类性评估方法的解决方案。据我们所知,本研究首次以统计学上严谨的方式为有效识别类别数据的可聚类性提供了可行方案。