CRYSTALS-Dilithium is a lattice-based signature scheme to be standardized by NIST as the primary post-quantum signature algorithm. In this work, we make a thorough study of optimizing the implementations of Dilithium by utilizing the Advanced Vector Extension (AVX) instructions, specifically AVX2 and the latest AVX512. We first present an improved parallel small polynomial multiplication with tailored early evaluation (PSPM-TEE) to further speed up the signing procedure, which results in a speedup of 5\%-6\% compared with the original PSPM Dilithium implementation. We then present a tailored reduction method that is simpler and faster than Montgomery reduction. Our optimized AVX2 implementation exhibits a speedup of 3\%-8\% compared with the state-of-the-art of Dilithium AVX2 software. Finally, for the first time, we propose a fully and highly vectorized implementation of Dilithium using AVX-512. This is achieved by carefully vectorizing most of Dilithium functions with the AVX512 instructions in order to improve efficiency both for time and for space simultaneously. With all the optimization efforts, our AVX-512 implementation improves the performance by 37.3\%/50.7\%/39.7\% in key generation, 34.1\%/37.1\%/42.7\% in signing, and 38.1\%/38.7\%/40.7\% in verification for the parameter sets of Dilithium2/3/5 respectively. To the best of our knowledge, our AVX512 implementation has the best performance for Dilithium on the Intel x64 CPU platform to date.
翻译:CRYSTALS-Dilithium是一种基于格上密码的签名方案,已被NIST选定为主要后量子签名算法标准。本文深入研究如何利用高级向量扩展(AVX)指令(特别是AVX2和最新的AVX512)优化Dilithium的实现。我们首先提出一种改进的并行小多项式乘法与定制早期评估(PSPM-TEE)方法,以进一步加速签名过程,相较于原始PSPM Dilithium实现实现了5%-6%的速度提升。随后提出一种比Montgomery归约更简单且更快的定制归约方法。我们优化的AVX2实现相较于现有最先进的Dilithium AVX2软件实现了3%-8%的加速。最后,我们首次提出使用AVX-512的完全高度向量化Dilithium实现。这通过精心向量化大多数Dilithium函数并利用AVX512指令实现,以同时提升时间与空间效率。经所有优化后,我们的AVX-512实现针对Dilithium2/3/5参数集,在密钥生成、签名和验证环节分别实现了37.3%/50.7%/39.7%、34.1%/37.1%/42.7%和38.1%/38.7%/40.7%的性能提升。据我们所知,我们的AVX512实现是目前Intel x64 CPU平台上性能最优的Dilithium实现。