Selecting a subset of the $k$ "best" items from a dataset of $n$ items, based on a scoring function, is a key task in decision-making. Given the rise of automated decision-making software, it is important that the outcome of this process, called top-$k$ selection, is fair. Here we consider the problem of identifying a fair linear scoring function for top-$k$ selection. The function computes a score for each item as a weighted sum of its (numerical) attribute values, and must ensure that the selected subset includes adequate representation of a minority or historically disadvantaged group. Existing algorithms do not scale efficiently, particularly in higher dimensions. Our hardness analysis shows that in more than two dimensions, no algorithm is likely to achieve good scalability with respect to dataset size, and the computational complexity is likely to increase rapidly with dimensionality. However, the hardness results also provide key insights guiding algorithm design, leading to our two-pronged solution: (1) For small values of $k$, our hardness analysis reveals a gap in the hardness barrier. By addressing various engineering challenges, including achieving efficient parallelism, we turn this potential of efficiency into an optimized algorithm delivering substantial practical performance gains. (2) For large values of $k$, where the hardness is robust, we employ a practically efficient algorithm which, despite being theoretically worse, achieves superior real-world performance. Experimental evaluations on real-world datasets then explore scenarios where worst-case behavior does not manifest, identifying areas critical to practical performance. Our solution achieves speed-ups of up to several orders of magnitude compared to SOTA, an efficiency made possible through a tight integration of hardness analysis, algorithm design, practical engineering, and empirical evaluation.
翻译:从包含$n$个条目的数据集中,基于评分函数选择$k$个"最佳"条目的子集,是决策过程中的一项关键任务。随着自动化决策软件的兴起,确保这一被称为top-$k$选择的过程结果具有公平性至关重要。本文研究为top-$k$选择识别一个公平线性评分函数的问题。该函数通过计算条目(数值)属性值的加权和来得到每个条目的分数,并必须确保所选子集能充分代表少数群体或历史上处于不利地位的群体。现有算法无法高效扩展,尤其在更高维度上。我们的困难性分析表明,在超过二维的情况下,任何算法都难以在数据集规模方面实现良好的可扩展性,且计算复杂度很可能随维度增加而急剧上升。然而,困难性结果也为算法设计提供了关键洞见,从而引导我们提出双管齐下的解决方案:(1)对于较小的$k$值,我们的困难性分析揭示了困难性壁垒中的一个缺口。通过解决包括实现高效并行化在内的各种工程挑战,我们将这一效率潜力转化为一种优化的算法,带来了显著的实际性能提升。(2)对于较大的$k$值,其困难性是稳固的,我们采用了一种实际高效的算法,该算法虽然在理论上较差,但在现实世界中实现了更优的性能。随后在真实数据集上的实验评估探索了最坏情况行为未显现的场景,识别出对实际性能至关重要的领域。与最先进技术(SOTA)相比,我们的解决方案实现了高达数个数量级的加速,这一效率得益于困难性分析、算法设计、工程实践与实证评估的紧密结合。