PaReNTT: Low-Latency Parallel Residue Number System and NTT-Based Long Polynomial Modular Multiplication for Homomorphic Encryption

High-speed long polynomial multiplication is important for applications in homomorphic encryption (HE) and lattice-based cryptosystems. This paper addresses low-latency hardware architectures for long polynomial modular multiplication using the number-theoretic transform (NTT) and inverse NTT (iNTT). Chinese remainder theorem (CRT) is used to decompose the modulus into multiple smaller moduli. Our proposed architecture, namely PaReNTT, makes four novel contributions. First, parallel NTT and iNTT architectures are proposed to reduce the number of clock cycles to process the polynomials. This can enable real-time processing for HE applications, as the number of clock cycles to process the polynomial is inversely proportional to the level of parallelism. Second, the proposed architecture eliminates the need for permuting the NTT outputs before their product is input to the iNTT. This reduces latency by n/4 clock cycles, where n is the length of the polynomial, and reduces buffer requirement by one delay-switch-delay circuit of size n. Third, an approach to select special moduli is presented where the moduli can be expressed in terms of a few signed power-of-two terms. Fourth, novel architectures for pre-processing for computing residual polynomials using the CRT and post-processing for combining the residual polynomials are proposed. These architectures significantly reduce the area consumption of the pre-processing and post-processing steps. The proposed long modular polynomial multiplications are ideal for applications that require low latency and high sample rate as these feed-forward architectures can be pipelined at arbitrary levels.

翻译：高速长多项式乘法对于同态加密（HE）和基于格密码系统的应用至关重要。本文研究了利用数论变换（NTT）及其逆变换（iNTT）实现长多项式模乘法的低延迟硬件架构。采用中国剩余定理（CRT）将模数分解为多个较小的模数。我们提出的架构名为PaReNTT，包含四项创新贡献。首先，提出了并行NTT和iNTT架构，以减少处理多项式所需的时钟周期数。由于处理多项式的时钟周期数与并行度成反比，这能实现HE应用的实时处理。其次，所提出的架构消除了在NTT输出结果相乘后输入iNTT前进行数据排列的需求。这将延迟减少n/4个时钟周期（其中n为多项式长度），并通过减少一个大小为n的延迟-交换-延迟电路来降低缓冲区需求。第三，提出了一种特殊模数选择方法，使模数可由少量有符号2的幂项表示。第四，提出了利用CRT计算剩余多项式的前处理模块和组合剩余多项式的后处理模块的新型架构。这些架构显著降低了前处理和后处理步骤的面积消耗。所提出的长模多项式乘法由于采用可任意级流水线化的前馈架构，特别适用于需要低延迟和高采样率的应用场景。