We consider the problem of fair column subset selection. In particular, we assume that two groups are present in the data, and the chosen column subset must provide a good approximation for both, relative to their respective best rank-k approximations. We show that this fair setting introduces significant challenges: in order to extend known results, one cannot do better than the trivial solution of simply picking twice as many columns as the original methods. We adopt a known approach based on deterministic leverage-score sampling, and show that merely sampling a subset of appropriate size becomes NP-hard in the presence of two groups. Whereas finding a subset of two times the desired size is trivial, we provide an efficient algorithm that achieves the same guarantees with essentially 1.5 times that size. We validate our methods through an extensive set of experiments on real-world data.
翻译:我们研究了公平列子集选择问题。具体而言,假设数据中存在两组群体,所选列子集必须为这两组群体提供良好的近似,且相对于它们各自的最佳秩-k近似。我们证明,这种公平设定带来了显著挑战:为了扩展已有结果,我们无法比简单选择原始方法两倍数量的列这一平凡解做得更好。我们采用基于确定性杠杆得分采样的已知方法,并表明在存在两组群体的情况下,仅采样适当大小的子集就变成了NP难问题。尽管寻找所需大小两倍的子集是平凡的,但我们提供了一种高效算法,能以本质上1.5倍大小实现相同的保证。我们通过在真实世界数据上的广泛实验验证了我们的方法。