For multivariate data, tandem clustering is a well-known technique aiming to improve cluster identification through initial dimension reduction. Nevertheless, the usual approach using principal component analysis (PCA) has been criticized for focusing solely on inertia so that the first components do not necessarily retain the structure of interest for clustering. To address this limitation, a new tandem clustering approach based on invariant coordinate selection (ICS) is proposed. By jointly diagonalizing two scatter matrices, ICS is designed to find structure in the data while providing affine invariant components. Certain theoretical results have been previously derived and guarantee that under some elliptical mixture models, the group structure can be highlighted on a subset of the first and/or last components. However, ICS has garnered minimal attention within the context of clustering. Two challenges associated with ICS include choosing the pair of scatter matrices and selecting the components to retain. For effective clustering purposes, it is demonstrated that the best scatter pairs consist of one scatter matrix capturing the within-cluster structure and another capturing the global structure. For the former, local shape or pairwise scatters are of great interest, as is the minimum covariance determinant (MCD) estimator based on a carefully chosen subset size that is smaller than usual. The performance of ICS as a dimension reduction method is evaluated in terms of preserving the cluster structure in the data. In an extensive simulation study and empirical applications with benchmark data sets, various combinations of scatter matrices as well as component selection criteria are compared in situations with and without outliers. Overall, the new approach of tandem clustering with ICS shows promising results and clearly outperforms the PCA-based approach.
翻译:针对多元数据,串联聚类是一种旨在通过初始降维改善聚类识别的经典技术。然而,以主成分分析(PCA)为常规手段的方法因仅聚焦于惯性而备受诟病,导致前几个分量未能充分保留用于聚类的感兴趣结构。为解决这一局限,本文提出一种基于不变坐标选择(ICS)的新型串联聚类方法。通过联合对角化两个散布矩阵,ICS旨在发现数据结构的同时提供仿射不变分量。已有理论结果保证了在某些椭圆混合模型下,群组结构可在第一和/或最后分量子集上被凸显。然而,ICS在聚类领域的关注度极低。该方法面临两大挑战:选择合适的散布矩阵对以及保留恰当的分量。为达到有效聚类目的,研究表明最佳散布矩阵对应包含一个捕捉组内结构的散布矩阵与一个捕捉全局结构的散布矩阵。对于前者而言,局部形状散布或成对散布具有重要价值,基于精心选择的(比常规更小的)子集规模的极小协方差行列式(MCD)估计量同样如此。本文评估了ICS作为降维方法在保留数据聚类结构方面的性能。通过涵盖基准数据集的广泛模拟研究及实证应用,我们系统比较了有无异常值情境下多种散布矩阵组合与分量选择准则。总体而言,基于ICS的新型串联聚类方法展现出令人满意的结果,明确优于基于PCA的方法。