CRYSTAL-Kyber (Kyber) is one of the post-quantum cryptography (PQC) key-encapsulation mechanism (KEM) schemes selected during the standardization process. This paper addresses optimization for Kyber architecture with respect to latency and throughput constraints. Specifically, matrix-vector multiplication and number theoretic transform (NTT)-based polynomial multiplication are critical operations and bottlenecks that require optimization. To address this challenge, we propose an algorithm and hardware co-design approach to systematically optimize matrix-vector multiplication and NTT-based polynomial multiplication by employing a novel sub-structure sharing technique in order to reduce computational complexity, i.e., the number of modular multiplications and modular additions/subtractions consumed. The sub-structure sharing approach is inspired by prior fast parallel approaches based on polyphase decomposition. The proposed efficient feed-forward architecture achieves high speed, low latency, and full utilization of all hardware components, which can significantly enhance the overall efficiency of the Kyber scheme. The FPGA implementation results show that our proposed design, using the fast two-parallel structure, leads to an approximate reduction of 90% in execution time, along with a 66 times improvement in throughput performance.
翻译:CRYSTALS-Kyber(简称Kyber)是标准化过程中选定的后量子密码(PQC)密钥封装机制(KEM)方案之一。本文针对Kyber架构在延迟与吞吐量约束下的优化问题展开研究。其中,矩阵-向量乘法与基于数论变换(NTT)的多项式乘法是关键运算瓶颈,需要重点优化。为解决该挑战,我们提出一种算法与硬件协同设计方法,通过采用新颖的子结构共享技术系统性地优化矩阵-向量乘法与基于NTT的多项式乘法,从而降低计算复杂度(即消耗的模乘与模加/模减运算次数)。该子结构共享方法受先前基于多相分解的快速并行方案启发。所提出的高效前馈架构实现了高速、低延迟及硬件组件的完全利用,可显著提升Kyber方案的整体效率。FPGA实现结果表明,采用快速二并行结构的设计可使执行时间降低约90%,同时吞吐量性能提升66倍。