Principal component analysis (PCA) is a key tool in the field of data dimensionality reduction. Various methods have been proposed to extend PCA to the union of subspace (UoS) setting for clustering data that comes from multiple subspaces like K-Subspaces (KSS). However, some applications involve heterogeneous data that vary in quality due to noise characteristics associated with each data sample. Heteroscedastic methods aim to deal with such mixed data quality. This paper develops a heteroscedastic-based subspace clustering method, named ALPCAHUS, that can estimate the sample-wise noise variances and use this information to improve the estimate of the subspace bases associated with the low-rank structure of the data. This clustering algorithm builds on K-Subspaces (KSS) principles by extending the recently proposed heteroscedastic PCA method, named LR-ALPCAH, for clusters with heteroscedastic noise in the UoS setting. Simulations and real-data experiments show the effectiveness of accounting for data heteroscedasticity compared to existing clustering algorithms. Code available at https://github.com/javiersc1/ALPCAHUS.
翻译:主成分分析(PCA)是数据降维领域的关键工具。已有多种方法将PCA扩展至子空间并集(UoS)框架,用于对来自多个子空间的数据进行聚类,例如K-子空间(KSS)方法。然而,某些应用涉及异构数据,这些数据因各样本的噪声特性而存在质量差异。异方差方法旨在处理此类混合质量数据。本文提出了一种基于异方差的子空间聚类方法ALPCAHUS,该方法能够估计样本级噪声方差,并利用该信息改进对数据低秩结构相关子空间基的估计。该聚类算法基于K-子空间(KSS)原理,通过扩展近期提出的异方差PCA方法LR-ALPCAH,使其适用于UoS框架下含异方差噪声的聚类场景。仿真和真实数据实验表明,相较于现有聚类算法,考虑数据异方差性能有效提升聚类性能。代码发布于https://github.com/javiersc1/ALPCAHUS。