Principal component analysis (PCA) is a key tool in the field of data dimensionality reduction. However, some applications involve heterogeneous data that vary in quality due to noise characteristics associated with each data sample. Heteroscedastic methods aim to deal with such mixed data quality. This paper develops a subspace learning method, named ALPCAH, that can estimate the sample-wise noise variances and use this information to improve the estimate of the subspace basis associated with the low-rank structure of the data. Our method makes no distributional assumptions of the low-rank component and does not assume that the noise variances are known. Further, this method uses a soft rank constraint that does not require subspace dimension to be known. Additionally, this paper develops a matrix factorized version of ALPCAH, named LR-ALPCAH, that is much faster and more memory efficient at the cost of requiring subspace dimension to be known or estimated. Simulations and real data experiments show the effectiveness of accounting for data heteroscedasticity compared to existing algorithms. Code available at https://github.com/javiersc1/ALPCAH.
翻译:主成分分析(PCA)是数据降维领域的关键工具。然而,某些应用涉及异构数据,这些数据因每个数据样本相关的噪声特性而存在质量差异。异方差方法旨在处理此类混合质量数据。本文提出了一种名为ALPCAH的子空间学习方法,该方法能够估计样本级噪声方差,并利用该信息改进与数据低秩结构相关的子空间基估计。我们的方法不对低秩分量做分布假设,也不假设噪声方差已知。此外,该方法采用软秩约束,无需已知子空间维度。本文进一步提出了ALPCAH的矩阵分解版本LR-ALPCAH,该版本在需要已知或估计子空间维度的前提下,实现了显著更高的计算速度与内存效率。仿真与真实数据实验表明,相较于现有算法,考虑数据异方差性能有效提升估计性能。代码发布于https://github.com/javiersc1/ALPCAH。