GPU Implementations for Midsize Integer Addition and Multiplication

This paper explores practical aspects of using a high-level functional language for GPU-based arithmetic on ``midsize'' integers. By this we mean integers of up to about a quarter million bits, which is sufficient for most practical purposes. The goal is to understand whether it is possible to support efficient nested-parallel programs with a small, flexible code base. We report on GPU implementations for addition and multiplication of integers that fit in one CUDA block, thus leveraging temporal reuse from scratchpad memories. Our key contribution resides in the simplicity of the proposed solutions: We recognize that addition is a straightforward application of scan, which is known to allow efficient GPU implementation. For quadratic multiplication we employ a simple work-partitioning strategy that offers good temporal locality. For FFT multiplication, we efficiently map the computation in the domain of integral fields by finding ``good'' primes that enable almost-full utilization of machine words. In comparison, related work uses complex tiling strategies -- which feel too big a hammer for the job -- or uses the computational domain of reals, which may degrade the magnitude of the base in which the computation is carried. We evaluate the performance in comparison to the state-of-the-art CGBN library, authored by NvidiaLab, and report that our CUDA prototype outperforms CGBN for integer sizes higher than 32K bits, while offering comparable performance for smaller sizes. Moreover, we are, to our knowledge, the first to report that FFT multiplication outperforms the classical one on the larger sizes that still fit in a CUDA block. Finally, we examine Futhark's strengths and weaknesses for efficiently supporting such computations and find out that a compiler pass aimed at efficient sequentialization of excess parallelism would significantly improve performance.

翻译：本文探讨了使用高级函数式语言在GPU上对"中等规模"整数进行算术运算的实际问题。此处"中等规模"指约二十五万比特以内的整数，足以满足大多数实际需求。本研究旨在探究是否能够通过精简灵活的代码库支持高效的嵌套并行程序。我们报告了适用于单个CUDA块的整数加法与乘法GPU实现，从而充分利用暂存存储器的时序复用特性。本工作的核心贡献在于所提出解决方案的简洁性：我们认识到加法本质上是扫描操作的直接应用，而扫描已知可实现高效的GPU实现。对于二次乘法，我们采用简单的工作划分策略以提供良好的时序局部性。对于FFT乘法，我们通过寻找能够实现机器字近乎完全利用的"优质"素数，将计算高效映射到整数域中。相比之下，相关工作采用复杂的分块策略（对此类任务而言过于繁重），或使用实数计算域（可能导致计算所用基数的量级退化）。我们通过与NvidiaLab开发的最先进CGBN库进行性能对比评估，结果表明当整数规模超过32K比特时，我们的CUDA原型实现优于CGBN，而在较小规模时性能相当。此外，据我们所知，我们首次报道了在仍适用于CUDA块的较大整数规模上，FFT乘法性能超越经典乘法算法。最后，我们考察了Futhark语言对此类计算高效支持的优缺点，发现通过编译器优化实现冗余并行化的高效串行化将显著提升性能。