基数排序优化的预测框架 (A Predictive Framework for Base-n Radix Sort Optimization)

Sorting is a foundational primitive of computer science and optimizations in sorting subroutines can cascade into significant performance gains for high-throughput systems. In this paper, we analyze the inefficiencies of a non-comparison sorting algorithm, namely, Base-n Radix Sort (BNRS), specifically the `zero padding' problem in skewed datasets. We develop an execution model, called, Stable Partitioning - Least Significant Digit Radix Sort (shortly, SP-LSD), an iterative least significant digit based pruning model designed to address this inefficiency. Based on this development, we derive the Radix Crossover Framework(RCF), an analytic three-point decision framework. The framework is established on the precondition of non-negative integers, which enables the derivation of three critical boundaries. First, the Asymptotic Crossover ($k<n^{\log_2 n}$) defines when BNRS and SP-LSD can theoretically outperform the comparison sorting algorithms where k is the maximum value and n is the input size. Second, the Round-feasibility Crossover ($k>n^2$) defines when overhead cost of implemented model SP-LSD is amortized. Third, we derive Pruning Crossover parameterized by the ratio of random-access sorting cost to sequential partitioning cost. This model demonstrates that SP-LSD yields a net gain on skewed and uniform distributions over standard BNRS. The experimental results are consistent with the crossover boundaries, providing a deterministic roadmap for adaptive algorithm selection.

翻译：排序是计算机科学的基础原语，对排序子程序的优化能够为高吞吐量系统带来显著的性能提升。本文分析了一种非比较排序算法——基数排序（BNRS）的效率问题，特别是偏斜数据集中的"零填充"问题。我们开发了一个名为稳定分区-最低有效位基数排序（简称SP-LSD）的执行模型，这是一种基于迭代最低有效位的剪枝模型，旨在解决此效率问题。基于此模型，我们推导出基数交叉框架（RCF），这是一个分析型的三点决策框架。该框架建立在非负整数的前提条件下，能够推导出三个关键边界：首先，渐进交叉边界（$k<n^{\log_2 n}$）定义了BNRS和SP-LSD在理论上优于比较排序算法的条件，其中k为最大值，n为输入规模。其次，轮次可行性交叉边界（$k>n^2$）定义了所实现模型SP-LSD的开销成本被分摊的条件。第三，我们推导出以随机访问排序成本与顺序分区成本之比为参数的剪枝交叉边界。该模型表明，相较于标准BNRS，SP-LSD在偏斜分布和均匀分布上均能产生净收益。实验结果与交叉边界保持一致，为自适应算法选择提供了确定性路线图。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

【干货书】数据分析优化，Optimization for Modern Data Analysis，117页pdf

专知会员服务

66+阅读 · 2023年2月15日

【Alex Nowak-Vila博士论文】有理论保证的结构化预测， Structured Prediction with Theoretical Guarantees

专知会员服务

13+阅读 · 2022年3月15日

如何使用TensorFlow 排序构建推荐系统? How to build a recommendation system using TensorFlow Ranking?

专知会员服务

19+阅读 · 2022年3月13日

【CMU-Yuejie Chi等干货书】满足低秩矩阵分解的非凸优化综述，69页pdf，Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview

专知会员服务

33+阅读 · 2022年3月4日