Fair Principal Component Analysis (PCA) is a problem setting where we aim to perform PCA while making the resulting representation fair in that the projected distributions, conditional on the sensitive attributes, match one another. However, existing approaches to fair PCA have two main problems: theoretically, there has been no statistical foundation of fair PCA in terms of learnability; practically, limited memory prevents us from using existing approaches, as they explicitly rely on full access to the entire data. On the theoretical side, we rigorously formulate fair PCA using a new notion called \emph{probably approximately fair and optimal} (PAFO) learnability. On the practical side, motivated by recent advances in streaming algorithms for addressing memory limitation, we propose a new setting called \emph{fair streaming PCA} along with a memory-efficient algorithm, fair noisy power method (FNPM). We then provide its {\it statistical} guarantee in terms of PAFO-learnability, which is the first of its kind in fair PCA literature. Lastly, we verify the efficacy and memory efficiency of our algorithm on real-world datasets.
翻译:公平主成分分析(Fair PCA)是一个旨在在执行PCA时确保结果表示公平的问题设定,即使得基于敏感属性的投影分布相互匹配。然而,现有的公平PCA方法存在两个主要问题:理论上,公平PCA在可学习性方面缺乏统计基础;实践中,有限的内存限制了现有方法的使用,因为它们明确依赖于对整个数据的完整访问。在理论方面,我们通过一种称为“可能近似公平且最优”(PAFO)可学习性的新概念,严谨地形式化了公平PCA。在实践方面,受近期流式算法解决内存限制问题的启发,我们提出了一种称为“公平流式PCA”的新设定,以及一种内存高效的算法——公平噪声幂法(FNPM)。随后,我们提供了其基于PAFO可学习性的统计保证,这是公平PCA文献中首次进行此类分析。最后,我们在真实世界数据集上验证了我们算法的有效性和内存效率。