We present an evaluation of bucketed approximate top-$k$ algorithms. Computing top-$k$ exactly suffers from limited parallelism, because the $k$ largest values must be aggregated along the vector, thus is not well suited to computation on highly-parallel machine learning accelerators. By relaxing the requirement that the top-$k$ is exact, bucketed algorithms can dramatically increase the parallelism available by independently computing many smaller top-$k$ operations. We explore the design choices of this class of algorithms using both theoretical analysis and empirical evaluation on downstream tasks. Our motivating examples are sparsity algorithms for language models, which often use top-$k$ to select the most important parameters or activations. We also release a fast bucketed top-$k$ implementation for PyTorch.
翻译:本文对分桶近似Top-$k$算法进行了评估。精确计算Top-$k$面临并行性受限的问题,因为需要沿向量聚合前$k$个最大值,因而不太适合在高度并行的机器学习加速器上进行计算。通过放宽Top-$k$必须精确的要求,分桶算法可通过独立计算多个较小规模的Top-$k$操作,显著提升可用并行度。我们通过理论分析和下游任务的实证评估,探讨了此类算法的设计选择。我们的研究动机源于语言模型中的稀疏化算法,这类算法常使用Top-$k$来选择最重要的参数或激活值。同时,我们为PyTorch发布了一个高效的分桶Top-$k$实现。