We consider the problem of fair column subset selection. In particular, we assume that two groups are present in the data, and the chosen column subset must provide a good approximation for both, relative to their respective best rank-k approximations. We show that this fair setting introduces significant challenges: in order to extend known results, one cannot do better than the trivial solution of simply picking twice as many columns as the original methods. We adopt a known approach based on deterministic leverage-score sampling, and show that merely sampling a subset of appropriate size becomes NP-hard in the presence of two groups. Whereas finding a subset of two times the desired size is trivial, we provide an efficient algorithm that achieves the same guarantees with essentially 1.5 times that size. We validate our methods through an extensive set of experiments on real-world data.
翻译:我们研究了公平列子集选择问题。特别地,我们假设数据中存在两个群体,所选列子集必须为这两个群体相对于各自最佳秩-k近似提供良好逼近。我们证明这一公平设定带来了显著挑战:为拓展现有结果,仅靠简单选取两倍于原始方法列数的平凡解已是最优选择。我们采用基于确定性杠杆得分采样的已知方法,并证明在存在两个群体时,仅采样适当规模的子集即成为NP难问题。尽管找到两倍于所需规模的子集是平凡的,但我们提出了一种高效算法,能以约1.5倍于该规模的大小实现相同保证。我们通过在真实数据上的大量实验验证了所提方法。