This paper presents efficient algorithms, designed to leverage SIMD for performing Montgomery reductions and additions on integers larger than 512 bits. The existing algorithms encounter inefficiencies when parallelized using SIMD due to extensive dependencies in both operations, particularly noticeable in costly operations like ARM's SVE. To mitigate this problem, a novel addition algorithm is introduced that simulates the addition of large integers using a smaller addition, quickly producing the same set of carries. These carries are then utilized to perform parallel additions on large integers. For Montgomery reductions, serial multiplications are replaced with precomputations that can be effectively calculated using SIMD extensions. Experimental evidence demonstrates that these proposed algorithms substantially enhance the performance of state-of-the-art implementations of several post-quantum cryptography algorithms. Notably, they deliver a 30% speed-up from the latest CTIDH implementation, an 11% speed-up from the latest CSIDH implementation in AVX-512 processors, and a 7% speed-up from Microsoft's standard PQCrypto-SIDH for SIKEp503 on A64FX.
翻译:本文提出了利用SIMD技术对超过512比特的大整数执行Montgomery约简与加法的高效算法。现有算法在使用SIMD并行化时,由于两种操作中存在大量依赖关系而效率低下,尤其是在ARM的SVE等高成本操作中尤为明显。为解决该问题,本文引入一种新型加法算法,该算法通过较小的加法模拟大整数加法,快速生成相同进位集,随后利用这些进位对大整数执行并行加法。针对Montgomery约简,本文用可通过SIMD扩展有效计算的预计算替代串行乘法。实验证据表明,所提算法显著提升了多种后量子密码算法最新实现的性能:在AVX-512处理器上,相比最新的CTIDH实现提速30%,相比最新的CSIDH实现提速11%;在A64FX上,相比微软标准PQCrypto-SIDH对SIKEp503的实现提速7%。