Unlabeled Principal Component Analysis and Matrix Completion

We introduce robust principal component analysis from a data matrix in which the entries of its columns have been corrupted by permutations, termed Unlabeled Principal Component Analysis (UPCA). Using algebraic geometry, we establish that UPCA is a well-defined algebraic problem in the sense that the only matrices of minimal rank that agree with the given data are row-permutations of the ground-truth matrix, arising as the unique solutions of a polynomial system of equations. Further, we propose an efficient two-stage algorithmic pipeline for UPCA suitable for the practically relevant case where only a fraction of the data have been permuted. Stage-I employs outlier-robust PCA methods to estimate the ground-truth column-space. Equipped with the column-space, Stage-II applies recent methods for unlabeled sensing to restore the permuted data. Allowing for missing entries on top of permutations in UPCA leads to the problem of unlabeled matrix completion, for which we derive theory and algorithms of similar flavor. Experiments on synthetic data, face images, educational and medical records reveal the potential of our algorithms for applications such as data privatization and record linkage.

翻译：我们提出了一种从数据矩阵中进行的鲁棒主成分分析，其中矩阵列的元素已被排列破坏，称为无标签主成分分析（UPCA）。利用代数几何，我们证明了UPCA是一个良定义的代数问题，这意味着与给定数据一致的唯一最小秩矩阵是真实矩阵的行排列，这些排列作为多项式方程系统的唯一解出现。此外，我们提出了一种适用于实际相关情况的双阶段算法流程，即仅部分数据被排列。第一阶段采用离群鲁棒PCA方法估计真实列空间。在获得列空间后，第二阶段应用最近的无标签感知方法来恢复被排列的数据。在UPCA中考虑排列之上的缺失条目，引出了无标签矩阵补全问题，我们为其推导了类似的理论和算法。在合成数据、人脸图像、教育及医疗记录上的实验，揭示了我们的算法在数据隐私化和记录链接等应用中的潜力。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日