The objective of clusterability evaluation is to check whether a clustering structure exists within the data set. As a crucial yet often-overlooked issue in cluster analysis, it is essential to conduct such a test before applying any clustering algorithm. If a data set is unclusterable, any subsequent clustering analysis would not yield valid results. Despite its importance, the majority of existing studies focus on numerical data, leaving the clusterability evaluation issue for categorical data as an open problem. Here we present TestCat, a testing-based approach to assess the clusterability of categorical data in terms of an analytical $p$-value. The key idea underlying TestCat is that clusterable categorical data possess many strongly correlated attribute pairs and hence the sum of chi-squared statistics of all attribute pairs is employed as the test statistic for $p$-value calculation. We apply our method to a set of benchmark categorical data sets, showing that TestCat outperforms those solutions based on existing clusterability evaluation methods for numeric data. To the best of our knowledge, our work provides the first way to effectively recognize the clusterability of categorical data in a statistically sound manner.
翻译:聚类性评估的目标是检测数据集中是否存在聚类结构。作为聚类分析中关键但常被忽视的问题,在应用任何聚类算法之前开展此类检验至关重要。若数据集不可聚类,后续的聚类分析将无法产生有效结果。尽管该问题具有重要性,但现有研究多聚焦于数值型数据,使得分类数据的聚类性评估仍是一个开放性问题。本文提出TestCat方法——一种基于测试框架、通过分析性p值评估分类数据可聚类性的方案。其核心思想在于:可聚类的分类数据包含大量强关联属性对,因此采用所有属性对的卡方统计量之和作为计算p值的检验统计量。我们将该方法应用于一组基准分类数据集,结果表明TestCat优于基于现有数值数据聚类性评估方法的解决方案。据我们所知,本工作首次以统计学上严谨的方式有效识别了分类数据的可聚类性。