We consider the Top-$K$ selection problem, which aims to identify the largest $K$ elements in an array. Top-$K$ selection arises in many machine learning algorithms and often becomes a bottleneck on accelerators, which are optimized for dense matrix multiplications. To address this problem, Chern et al. (2022) proposed a fast two-stage approximate Top-$K$ algorithm that: (i) partitions the input array into equal-sized chunks and selects the top-$1$ element from each partition; and (ii) sorts the resulting smaller subset and returns the top $K$ elements. In this paper, we generalize the first stage so that each partition selects the top $K'$ elements (for $1 \leq K' \leq K$). Our contributions include: (i) an expression for the expected recall of this generalized algorithm under random partitioning, and a demonstration that choosing $K' > 1$ with fewer partitions in the first stage more effectively reduces the input size to the second stage while maintaining the same expected recall as the original algorithm; (ii) a bound on the expected recall of the original algorithm as a function of the algorithm parameters that is provably tighter by a factor of $2$ than the bound reported by Chern et al. (2022); and (iii) an implementation of our algorithm on Cloud TPUv5e that achieves approximately an order of magnitude speedup over the original algorithm without sacrificing recall.
翻译:我们考虑Top-K选择问题,其目标是在数组中找出最大的K个元素。Top-K选择出现在许多机器学习算法中,并且常成为加速器(针对密集型矩阵乘法进行了优化)上的性能瓶颈。为解决此问题,Chern等人(2022)提出了一种快速两阶段近似Top-K算法,该算法:(i)将输入数组划分为大小相等的块,并从每个分区中选择top-1元素;(ii)对生成的较小子集进行排序并返回前K个元素。本文对第一阶段进行泛化,使每个分区选择前K'个元素(其中1 ≤ K' ≤ K)。我们的贡献包括:(i)在随机分区下该泛化算法期望召回率的表达式,并证明在第一阶段选择K' > 1且使用更少分区时,能在保持与原算法相同期望召回率的同时更有效地减小第二阶段输入规模;(ii)给出原算法期望召回率作为算法参数的函数表达式,其边界被证明比Chern等人(2022)报告的边界严格2倍;(iii)在Cloud TPUv5e上实现我们的算法,在不牺牲召回率的情况下实现约一个数量级的加速比。