ALPCAH: Sample-wise Heteroscedastic PCA with Tail Singular Value Regularization

from arxiv, This article has been accepted for publication in the Fourteenth International Conference on Sampling Theory and Applications, accessible via IEEE XPlore. See DOI section

Principal component analysis (PCA) is a key tool in the field of data dimensionality reduction that is useful for various data science problems. However, many applications involve heterogeneous data that varies in quality due to noise characteristics associated with different sources of the data. Methods that deal with this mixed dataset are known as heteroscedastic methods. Current methods like HePPCAT make Gaussian assumptions of the basis coefficients that may not hold in practice. Other methods such as Weighted PCA (WPCA) assume the noise variances are known, which may be difficult to know in practice. This paper develops a PCA method that can estimate the sample-wise noise variances and use this information in the model to improve the estimate of the subspace basis associated with the low-rank structure of the data. This is done without distributional assumptions of the low-rank component and without assuming the noise variances are known. Simulations show the effectiveness of accounting for such heteroscedasticity in the data, the benefits of using such a method with all of the data versus retaining only good data, and comparisons are made against other PCA methods established in the literature like PCA, Robust PCA (RPCA), and HePPCAT. Code available at https://github.com/javiersc1/ALPCAH

翻译：主成分分析（PCA）是数据降维领域的关键工具，对各类数据科学问题具有重要价值。然而，许多应用涉及因数据来源不同而存在噪声特征差异的异质数据。处理这类混合数据集的方法被称为异方差方法。现有方法如HePPCAT对基系数的高斯假设在实际中可能不成立，而加权PCA等方法假设噪声方差已知，这在实践中往往难以实现。本文提出一种PCA方法，能够估计样本级噪声方差，并将其纳入模型以改进与数据低秩结构相关的子空间基估计。该方法无需对低秩分量进行分布假设，也无需假设噪声方差已知。仿真实验证明了考虑数据异方差性的有效性、在全部数据上使用该方法相较于仅保留优质数据的优势，并与PCA、鲁棒主成分分析及HePPCAT等现有PCA方法进行了对比。代码详见https://github.com/javiersc1/ALPCAH

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日