CRYSTALS-Dilithium is a lattice-based signature scheme to be standardized by NIST as the primary post-quantum signature algorithm. In this work, we make a thorough study of optimizing the implementations of Dilithium by utilizing the Advanced Vector Extension (AVX) instructions, specifically AVX2 and the latest AVX-512. We first present an improved parallel small polynomial multiplication with tailored early evaluation (PSPM-TEE) to further speed up the signing procedure. Our PSPM algorithm outperform the NTT by 47%-66% in AVX2 and AVX-512 implementation. We then present a tailored reduction method that is simpler and faster than Montgomery reduction. We minimize the CPU cycles of tailored reduction AVX-512 implementation by using AVX-512IFMA. Finally, we propose a fully and highly vectorized implementation of Dilithium using AVX-512. This is achieved by carefully vectorizing most of Dilithium functions with the AVX-512 instructions in order to improve efficiency both for time and for space simultaneously. With all the optimization efforts, our AVX-512 implementation improves the performance by 43.2%/39.3%/45.6% in key generation, 36.6%/41.6%/43.7% in signing, and 45.3%/46.5%/47.4% in verification for the parameter sets of Dilithium2/3/5 respectively. To the best of our knowledge, our AVX-512 implementation has the best performance for Dilithium on the Intel x86-64 CPU platform to date.
翻译:CRYSTALS-Dilithium是一种基于格的签名方案,已被NIST选定为主要后量子签名算法。本文深入研究了如何利用高级向量扩展(Advanced Vector Extension, AVX)指令(特别是AVX2和最新的AVX-512)优化Dilithium的实现。首先,我们提出了一种改进的并行小多项式乘法与定制早期评估(PSPM-TEE)方法,以进一步提升签名速度。在AVX2和AVX-512实现中,我们的PSPM算法相较于NTT实现了47%-66%的性能提升。其次,我们提出了一种比Montgomery归约更简单、更快速的定制归约方法,并通过使用AVX-512IFMA最小化了定制归约AVX-512实现的CPU周期。最后,我们利用AVX-512提出了一种完全且高度向量化的Dilithium实现。为此,我们通过AVX-512指令精心向量化了Dilithium的绝大部分函数,以同时提升时间和空间效率。综合所有优化努力,我们的AVX-512实现在Dilithium2/3/5参数集下,密钥生成性能分别提升了43.2%/39.3%/45.6%,签名性能提升了36.6%/41.6%/43.7%,验证性能提升了45.3%/46.5%/47.4%。据我们所知,我们的AVX-512实现是迄今为止Intel x86-64 CPU平台上Dilithium性能最佳的方案。