This paper presents software implementations of batch computations, dealing with multi-precision integer operations. In this work, we use the Single Instruction Multiple Data (SIMD) AVX512 instruction set of the x86-64 processors, in particular the vectorized fused multiplier-adder VPMADD52. We focus on batch multiplications, squarings, modular multiplications, modular squarings and constant time modular exponentiations of 8 values using a word-slicing storage. We explore the use of Schoolbook and Karatsuba approaches with operands up to 4108 and 4154 bits respectively. We also introduce a truncated multiplication that speeds up the computation of the Montgomery modular reduction in the context of software implementation. Our Truncated Montgomery modular multiplication improvement offers speed gains of almost 20 % over the conventional non-truncated versions. Compared to the state-of-the-art GMP and OpenSSL libraries, our speedup modular operations are more than 4 times faster. Compared to OpenSSL BN_mod_exp_mont_consttimex2 using AVX512 and madd52* (madd52hi or madd52lo) in 256-bit registers, in fixed-window exponentiations of sizes 1024 and 2048 , our 512-bit implementation provides speedups of respectively 1.75 and 1.38, while the 256-bit version speedups are 1.51 and 1.05 for 1024 and 2048 -bit sizes (batch of 4 values in this case).
翻译:本文提出了处理多精度整数运算的批量计算软件实现。本工作利用x86-64处理器的单指令多数据(SIMD)AVX512指令集,特别是向量化融合乘加指令VPMADD52。我们专注于采用字切片存储方式对8个数值进行批量乘法、平方、模乘、模平方以及恒定时间模幂运算。我们探索了Schoolbook与Karatsuba方法在操作数分别达到4108位和4154位时的应用。此外,我们引入了一种截断乘法技术,可在软件实现背景下加速蒙哥马利模约减的计算。我们的截断蒙哥马利模乘改进方案相比传统非截断版本实现了近20%的速度提升。相较于最先进的GMP和OpenSSL库,我们的加速模运算性能提升超过4倍。与OpenSSL中使用AVX512及madd52*(madd52hi或madd52lo)指令在256位寄存器中实现的BN_mod_exp_mont_consttimex2相比,在1024位和2048位固定窗口模幂运算中,我们的512位实现分别实现了1.75倍和1.38倍的加速;而256位版本在1024位和2048位运算(此情况下为4值批量处理)中分别实现了1.51倍和1.05倍的加速。