Selecting a subset of the $k$ "best" items from a dataset of $n$ items, based on a scoring function, is a key task in decision-making. Given the widespread use of automated decision-making software nowadays, it is important that the outcome of this process, called top-$k$ selection, is fair. Here we consider the problem of identifying a linear scoring function for top-$k$ selection that is fair. The function computes a score for each item as a weighted sum of its (numerical) attribute values. Additionally, the function must ensure that the subset selected is a faithful representative of the entire dataset for a minority or historically disadvantaged group. Existing algorithms do not scale effectively on large, high-dimensional datasets. Our theoretical analysis shows that in more than two dimensions, no algorithm is likely to achieve good scalability with respect to dataset size (i.e., a run time of $O(n\cdot \text{polylog}(n))$), and the computational complexity is likely to increase rapidly with dimensionality. However, there are exceptions for small values of $k$ and for this case we provide significantly faster algorithms. We also provide efficient practical variants of these algorithms. Our implementations of these take advantage of modern hardware (e.g., exploiting parallelism). For large values of $k$, we give an alternative algorithm that, while theoretically worse, performs better in practice. Experimental results on real-world datasets demonstrate the efficiency of our proposed algorithms, which achieve speed-ups of up to several orders of magnitude compared to the state of the art (SoTA).
翻译:从包含 $n$ 个条目的数据集中,基于评分函数选出 $k$ 个"最佳"条目的子集,是决策过程中的一项关键任务。鉴于自动化决策软件在当今的广泛应用,确保这一被称为 top-$k$ 选择的过程结果是公平的至关重要。本文探讨如何为 top-$k$ 选择识别一个公平的线性评分函数。该函数通过计算条目(数值)属性值的加权和来得到每个条目的分数。此外,该函数必须确保所选子集能够公平地代表整个数据集中的少数群体或历史上处于不利地位的群体。现有算法在大型高维数据集上无法有效扩展。我们的理论分析表明,在二维以上的情况下,任何算法都难以在数据集规模上实现良好的可扩展性(即达到 $O(n\cdot \text{polylog}(n))$ 的运行时间),并且计算复杂度很可能随维度的增加而迅速增长。然而,对于较小的 $k$ 值存在例外情况,我们为此提供了显著更快的算法。我们还给出了这些算法的实用高效变体。我们的实现利用了现代硬件特性(例如并行计算)。对于较大的 $k$ 值,我们提出了另一种算法,该算法虽然在理论上效率较低,但在实践中表现更好。在真实数据集上的实验结果表明,我们提出的算法具有高效性,与现有最优技术(SoTA)相比,实现了高达数个数量级的加速。