Cluster analysis requires many decisions: the clustering method and the implied reference model, the number of clusters and, often, several hyper-parameters and algorithms' tunings. In practice, one produces several partitions, and a final one is chosen based on validation or selection criteria. There exist an abundance of validation methods that, implicitly or explicitly, assume a certain clustering notion. Moreover, they are often restricted to operate on partitions obtained from a specific method. In this paper, we focus on groups that can be well separated by quadratic or linear boundaries. The reference cluster concept is defined through the quadratic discriminant score function and parameters describing clusters' size, center and scatter. We develop two cluster-quality criteria called quadratic scores. We show that these criteria are consistent with groups generated from a general class of elliptically-symmetric distributions. The quest for this type of groups is common in applications. The connection with likelihood theory for mixture models and model-based clustering is investigated. Based on bootstrap resampling of the quadratic scores, we propose a selection rule that allows choosing among many clustering solutions. The proposed method has the distinctive advantage that it can compare partitions that cannot be compared with other state-of-the-art methods. Extensive numerical experiments and the analysis of real data show that, even if some competing methods turn out to be superior in some setups, the proposed methodology achieves a better overall performance.
翻译:聚类分析需要做出多项决策:聚类方法及其隐含的参考模型、聚类数量,以及若干超参数和算法的调优。在实践中,研究者会生成多个分区,并基于验证或选择准则确定最终分区。现有大量验证方法虽隐式或显式假设了某种聚类概念,但通常仅限于对特定方法获取的分区进行操作。本文聚焦于可通过二次或线性边界良好分离的组群。我们通过二次判别分数函数及描述簇大小、中心和散布的参数定义参考聚类概念,并开发了两个称为二次分数的聚类质量准则。研究表明,这些准则与广义椭圆对称分布族生成的组群具有一致性——此类组群在应用中十分常见。同时,我们探讨了这些准则与混合模型及基于模型的聚类的似然理论之间的关联。基于二次分数的自助重抽样,我们提出了一种可在多个聚类解中择优的选择规则。该方法的独特优势在于能比较其他前沿方法无法比较的分区。大量数值实验和真实数据分析表明,尽管某些竞争方法在特定场景下表现更优,但所提方法在整体性能上更具优势。