The prefix sum operation is a useful primitive with a broad range of applications. For database systems, it is a building block of many important operators including join, sort and filter queries. In this paper, we study different methods of computing prefix sums with SIMD instructions and multiple threads. For SIMD, we implement and compare horizontal and vertical computations, as well as a theoretically work-efficient balanced tree version using gather/scatter instructions. With multithreading, the memory bandwidth can become the bottleneck of prefix sum computations. We propose a new method that partitions data into cache-sized smaller partitions to achieve better data locality and reduce bandwidth demands from RAM. We also investigate four different ways of organizing the computation sub-procedures, which have different performance and usability characteristics. In the experiments we find that the most efficient prefix sum computation using our partitioning technique is up to 3x faster than two standard library implementations that already use SIMD and multithreading.
翻译:前缀和运算是一种具有广泛应用场景的基础性原语操作。在数据库系统中,它是包括连接、排序和过滤查询在内的许多重要运算符的构建模块。本文研究了使用SIMD指令和多线程计算前缀和的不同方法。针对SIMD实现,我们比较了水平计算与垂直计算两种模式,并利用聚集/分散指令实现了一种理论上工作高效的平衡树版本。在多线程环境下,内存带宽可能成为前缀和计算的瓶颈。我们提出了一种新方法,通过将数据划分为缓存适配的小分区来实现更好的数据局部性,从而减少对RAM的带宽需求。此外,我们还研究了四种不同的计算子过程组织方式,这些方式在性能和可用性方面各具特点。实验表明,采用我们的分区技术实现的最优前缀和计算方案,其执行速度比已使用SIMD和多线程的两个标准库实现快达3倍。