The Number Theoretic Transform (NTT) is a critical computational bottleneck in many lattice-based postquantum cryptographic (PQC) algorithms. By leveraging the Fast Fourier Transform (FFT) algorithm, the NTT of a polynomial of degree N - 1 can be computed with a time complexity of O(N log N). Hardware implementation of NTT is generally preferred over software ones, as the latter are significantly slower due to complex memory access patterns and modular arithmetic operations. Achieving maximum throughput in hardware, however, typically demands a prohibitively large number of butterfly unit instantiations. In this work, we propose @NTT, which exploits the fact that the ring parameters in these algorithms are fixed, enabling design-time constant optimization and achieving the maximum throughput of N-point NTT per clock cycle with a compact hardware footprint. Our case study on the Dilithium NTT, implemented using the TSMC 28 nm library, operates at a clock frequency of 1.0 GHz with an area of 1.45 mm^2. On FPGA, the design achieves a throughput-per-LUT that is 5.2x higher than the state-of-the-art implementation.
翻译:数论变换(NTT)是许多基于格的后量子密码(PQC)算法中的关键计算瓶颈。通过利用快速傅里叶变换(FFT)算法,次数为 N-1 的多项式的 NTT 计算时间复杂度可降至 O(N log N)。硬件实现的 NTT 通常优于软件实现,因为后者受复杂的存储器访问模式和模算术运算影响,速度显著较慢。然而,在硬件中实现最大吞吐量通常需要数量庞大的蝶形单元实例化,这往往难以实现。本文提出 @NTT 方法,该方法利用这些算法中环参数固定的特性,通过设计时常数优化,在紧凑的硬件面积下实现每时钟周期 N 点 NTT 的最大吞吐量。我们基于 TSMC 28 nm 工艺库实现的 Dilithium NTT 案例研究,在 1.0 GHz 时钟频率下运行,面积为 1.45 mm^2。在 FPGA 上,该设计的每 LUT 吞吐量比现有最先进实现高出 5.2 倍。