For multivariate data, tandem clustering is a well-known technique aiming to improve cluster identification through initial dimension reduction. Nevertheless, the usual approach using principal component analysis (PCA) has been criticized for focusing solely on inertia so that the first components do not necessarily retain the structure of interest for clustering. To address this limitation, a new tandem clustering approach based on invariant coordinate selection (ICS) is proposed. By jointly diagonalizing two scatter matrices, ICS is designed to find structure in the data while providing affine invariant components. Certain theoretical results have been previously derived and guarantee that under some elliptical mixture models, the group structure can be highlighted on a subset of the first and/or last components. However, ICS has garnered minimal attention within the context of clustering. Two challenges associated with ICS include choosing the pair of scatter matrices and selecting the components to retain. For effective clustering purposes, it is demonstrated that the best scatter pairs consist of one scatter matrix capturing the within-cluster structure and another capturing the global structure. For the former, local shape or pairwise scatters are of great interest, as is the minimum covariance determinant (MCD) estimator based on a carefully chosen subset size that is smaller than usual. The performance of ICS as a dimension reduction method is evaluated in terms of preserving the cluster structure in the data. In an extensive simulation study and empirical applications with benchmark data sets, various combinations of scatter matrices as well as component selection criteria are compared in situations with and without outliers. Overall, the new approach of tandem clustering with ICS shows promising results and clearly outperforms the PCA-based approach.
翻译:对于多变量数据,串联聚类是一种旨在通过初始降维改善聚类识别的常用技术。然而,使用主成分分析(PCA)的常规方法因仅关注惯性而受到批评,导致前几个主成分不一定保留用于聚类的感兴趣结构。为解决这一局限,提出了一种基于不变坐标选择(ICS)的新型串联聚类方法。通过联合对角化两个散布矩阵,ICS旨在发现数据中的结构,同时提供仿射不变分量。已有若干理论结果保证,在某些椭圆混合模型下,组结构可以在第一和/或最后分量的子集上得到凸显。然而,ICS在聚类背景下的关注度甚微。与ICS相关的两个挑战包括:选择散布矩阵对以及选择需保留的分量。为实现有效聚类,证明最佳散布对由一个捕获组内结构的散布矩阵和另一个捕获全局结构的散布矩阵组成。对于前者,局部形状或成对散布尤为重要,基于比常规更小心选择的子集大小的最小协方差行列式(MCD)估计也是如此。ICS作为降维方法在保留数据中的聚类结构方面的性能得到了评估。在广泛的模拟研究和基准数据集的实证应用中,比较了各种散布矩阵组合及分量选择准则在有异常值和无异常值情况下的表现。总体而言,基于ICS的串联聚类新方法显示出有前景的结果,并明显优于基于PCA的方法。