FractalSortCPU: Bandwidth-Efficient Compressed Radix Sort on CPU

Cloud database systems, particularly their middleware and query execution layers, use sorting as a core operation in query processing, indexing and join execution. Distribution-dependence and limited parallelism are key issues inherent in state-of-the-art radix sort which is preferred for large datasets due to performance advantages over comparison-based algorithms. Multi-pass bucketing, stochastic sampling and dependence graph structures are common solutions to these problems that incur the cost of data pre-processing and increased memory footprint hence they are less appropriate for large-scale workloads common in cloud environments. In-place radix sort schemes increase the number of passes as precision increases, which negatively impacts latency. Our work solves these problems by introducing a CPU-adapted histogram compression scheme for radix sorting for arbitrary-precision keys implemented on the CPU for increased accessibility, providing state-of-the-art execution time, while limiting histogram growth. Fully parallel key-based histogram updates eliminate the need for input bucketing and data pre-processing further lowering latency, mitigating distribution-dependence and reducing complexity. With a parallelized sorting architecture utilizing SIMD-accelerated operations for low latency, the algorithm demonstrates improvement over the state-of-the-art on the CPU, GPU, and FPGA by 6x, 3x and 2.5x in bandwidth efficiency on 512MB to 32GB data sets at 16-bit precision.

翻译：云数据库系统，特别是其中的中间件和查询执行层，将排序作为查询处理、索引构建和连接执行中的核心操作。当前最先进的基数排序因性能优于基于比较的算法而被优先用于大数据集，但其本身存在分布依赖性和并行度有限的关键问题。多趟分桶、随机采样和依赖图结构是解决这些问题的常用方案，然而这些方案会引入数据预处理开销并增加内存占用，因此不适用于云环境中常见的大规模工作负载。就地基数排序方案随着精度提升而增加排序趟数，从而对延迟产生负面影响。我们的工作通过引入一种针对任意精度键的CPU适配直方图压缩方案来解决这些问题，该方案在CPU上实现以提高可访问性，在执行时间达到最先进水平的同时限制直方图增长。全并行键级直方图更新消除了输入分桶和数据预处理的需求，进一步降低延迟、缓解分布依赖性并降低复杂度。利用基于SIMD加速的低延迟操作构建并行排序架构，该算法在16位精度下处理512MB至32GB数据集时，带宽效率相比CPU、GPU和FPGA上的最先进方案分别提升6倍、3倍和2.5倍。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

智能数据库学习型索引研究综述

专知会员服务

23+阅读 · 2023年1月14日

【ICML2022】DepthShrinker:一种新的压缩范式，用于提高紧凑神经网络的实际硬件效率

专知会员服务

11+阅读 · 2022年6月5日

如何使用TensorFlow 排序构建推荐系统? How to build a recommendation system using TensorFlow Ranking?

专知会员服务

19+阅读 · 2022年3月13日

中科院计算所发布首篇「面向第一阶段检索的语义检索模型」综述论文，43页pdf242篇文献

专知会员服务

25+阅读 · 2021年10月3日