For multivariate data with noise variables, tandem clustering is a well-known technique that aims to improve cluster identification by first reducing the dimension. However, the usual approach using principal component analysis (PCA) has been criticized for focusing only on inertia so that the first components do not necessarily retain the structure of interest for clustering. To overcome this drawback, a new tandem clustering approach based on invariant coordinate selection (ICS) is proposed. By jointly diagonalizing two scatter matrices, ICS is designed to find structure in the data while returning affine invariant components. Some theoretical results have already been derived and guarantee that under some elliptical mixture models, the group structure can be highlighted on a subset of the first and/or last components. Nevertheless, ICS has received little attention in a clustering context. Two challenges are the choice of the pair of scatter matrices and the selection of the components to retain. For clustering purposes, it is demonstrated that the best scatter pairs consist of one scatter matrix that captures the within-cluster structure and another that captures the global structure. For the former, local shape or pairwise scatters are of great interest, as is the minimum covariance determinant (MCD) estimator based on a carefully selected subset size that is smaller than usual. The performance of ICS as a dimension reduction method is evaluated in terms of preserving the cluster structure present in data. In an extensive simulation study and in empirical applications with benchmark data sets, different combinations of scatter matrices as well as component selection criteria are compared in situations with and without outliers. Overall, the new approach of tandem clustering with ICS shows promising results and clearly outperforms the approach with PCA.
翻译:针对含噪声变量的多元数据,串联聚类是一种通过先降维来改善聚类识别的常用技术。然而,传统基于主成分分析(PCA)的方法因仅关注惯性而受到批评,导致前几个主成分不一定保留聚类所需的结构信息。为解决这一缺陷,本文提出了一种基于不变坐标选择(ICS)的新型串联聚类方法。ICS通过联合对角化两个散布矩阵,在返回仿射不变成分的同时挖掘数据中的结构。已有理论结果表明,在特定椭圆混合模型下,组结构可显现在前部和/或后部成分的子集上。尽管如此,ICS在聚类领域尚未得到足够重视。其面临的两项挑战是:散布矩阵对的选择与待保留成分的选取。针对聚类目标,本文证明最优散布对应包含一个捕获组内结构的散布矩阵与另一个捕获全局结构的散布矩阵。对于前者,局部形状或成对散布具有显著价值,而基于精心选取的较小子集尺寸的最小协方差行列式(MCD)估计量(其子集尺寸小于常规取值)同样具有重要价值。本文从保持数据中聚类结构的角度评估了ICS作为降维方法的性能。通过大规模模拟实验及基准数据集的实证应用,本文对比了有无异常值场景下不同散布矩阵组合及成分选择准则的表现。总体而言,基于ICS的新型串联聚类方法展现出显著优势,其性能明显优于基于PCA的方法。