Invariant coordinate selection is an unsupervised multivariate data transformation useful in many contexts such as outlier detection or clustering. It is based on the simultaneous diagonalization of two affine equivariant and positive definite scatter matrices. Its classical implementation relies on a non-symmetric eigenvalue problem by diagonalizing one scatter relatively to the other. In case of collinearity, at least one of the scatter matrices is singular, making the problem unsolvable. To address this limitation, three approaches are proposed using: a Moore-Penrose pseudo inverse, a dimension reduction, and a generalized singular value decomposition. Their properties are investigated both theoretically and through various empirical applications. Overall, the extension based on the generalized singular value decomposition seems the most promising, even though it restricts the choice of scatter matrices to those that can be expressed as cross-products. In practice, some of the approaches also appear suitable in the context of data in high-dimension low-sample-size data.
翻译:不变坐标选择是一种无监督多元数据变换方法,在异常值检测或聚类等众多场景中具有重要应用价值。该方法基于两个仿射等变且正定的散布矩阵的联合对角化实现。其经典实现依赖于非对称特征值问题,通过将一个散布矩阵相对于另一个进行对角化来完成。当数据存在共线性时,至少一个散布矩阵会呈现奇异性,导致问题无法求解。为突破这一局限,本文提出三种解决方案:采用摩尔-彭罗斯伪逆、实施降维处理、以及运用广义奇异值分解。我们通过理论分析和多组实证应用对这些方法的特性进行了系统探究。总体而言,基于广义奇异值分解的扩展方案虽然将散布矩阵的选择范围限制为可表示为叉积的形式,但展现出最优的应用前景。在实际应用中,部分方法对高维小样本数据也表现出良好的适应性。