Cloud database systems, particularly their middleware and query execution layers, use sorting as a core operation in query processing, indexing and join execution. Distribution-dependence and limited parallelism are key issues inherent in state-of-the-art radix sort which is preferred for large datasets due to performance advantages over comparison-based algorithms. Multi-pass bucketing, stochastic sampling and dependence graph structures are common solutions to these problems that incur the cost of data pre-processing and increased memory footprint hence they are less appropriate for large-scale workloads common in cloud environments. In-place radix sort schemes increase the number of passes as precision increases, which negatively impacts latency. Our work solves these problems by introducing a CPU-adapted histogram compression scheme for radix sorting for arbitrary-precision keys implemented on the CPU for increased accessibility, providing state-of-the-art execution time, while limiting histogram growth. Fully parallel key-based histogram updates eliminate the need for input bucketing and data pre-processing further lowering latency, mitigating distribution-dependence and reducing complexity. With a parallelized sorting architecture utilizing SIMD-accelerated operations for low latency, the algorithm demonstrates improvement over the state-of-the-art on the CPU, GPU, and FPGA by 6x, 3x and 2.5x in bandwidth efficiency on 512MB to 32GB data sets at 16-bit precision.
翻译:云数据库系统,特别是其中的中间件和查询执行层,将排序作为查询处理、索引构建和连接执行中的核心操作。当前最先进的基数排序因性能优于基于比较的算法而被优先用于大数据集,但其本身存在分布依赖性和并行度有限的关键问题。多趟分桶、随机采样和依赖图结构是解决这些问题的常用方案,然而这些方案会引入数据预处理开销并增加内存占用,因此不适用于云环境中常见的大规模工作负载。就地基数排序方案随着精度提升而增加排序趟数,从而对延迟产生负面影响。我们的工作通过引入一种针对任意精度键的CPU适配直方图压缩方案来解决这些问题,该方案在CPU上实现以提高可访问性,在执行时间达到最先进水平的同时限制直方图增长。全并行键级直方图更新消除了输入分桶和数据预处理的需求,进一步降低延迟、缓解分布依赖性并降低复杂度。利用基于SIMD加速的低延迟操作构建并行排序架构,该算法在16位精度下处理512MB至32GB数据集时,带宽效率相比CPU、GPU和FPGA上的最先进方案分别提升6倍、3倍和2.5倍。