The problem of column subset selection asks for a subset of columns from an input matrix such that the matrix can be reconstructed as accurately as possible within the span of the selected columns. A natural extension is to consider a setting where the matrix rows are partitioned into two groups, and the goal is to choose a subset of columns that minimizes the maximum reconstruction error of both groups, relative to their respective best rank-k approximation. Extending the known results of column subset selection to this fair setting is not straightforward: in certain scenarios it is unavoidable to choose columns separately for each group, resulting in double the expected column count. We propose a deterministic leverage-score sampling strategy for the fair setting and show that sampling a column subset of minimum size becomes NP-hard in the presence of two groups. Despite these negative results, we give an approximation algorithm that guarantees a solution within 1.5 times the optimal solution size. We also present practical heuristic algorithms based on rank-revealing QR factorization. Finally, we validate our methods through an extensive set of experiments using real-world data.
翻译:列子集选择问题要求从输入矩阵中选取一个列子集,使得矩阵能够在所选列的张成空间内被尽可能精确地重构。一个自然的扩展是考虑矩阵行被划分为两组的情形,其目标是选择一个列子集,使得两组相对于各自最佳秩k近似的重构误差最大值最小化。将列子集选择的已知结果扩展到这一公平设定并非易事:在某些场景下,不可避免地需要为每组分别选择列,从而导致预期列数翻倍。我们针对公平设定提出了一种确定性杠杆值采样策略,并证明当存在两组时,采样最小规模的列子集是NP难问题。尽管存在这些负面结果,我们给出了一种近似算法,其解规模保证在最优解规模的1.5倍以内。我们还提出了基于秩揭示QR分解的实用启发式算法。最后,我们通过使用真实世界数据的大量实验验证了所提方法。