Modular multiplication is a fundamental arithmetic primitive in Residue Number Systems (RNS) and is often the dominant source of delay, area, and energy consumption in RNS datapaths used in cryptography, signal processing, and machine-learning accelerators. Recent work introduced a twit-based residue representation for moduli of the form $2^n \pm δ$, with $0 \le δ\le 2^{n-1}-1$, and showed that it enables efficient generic modular addition and subtraction across the full admissible $δ$ range. However, an efficient modular multiplier compatible with the same representation has remained unavailable. This paper presents a generic twit-based modulo-$(2^n \pm δ)$ multiplier for RNS channels. The proposed architecture computes the product through operand splitting, modular partial-product generation, carry-save accumulation, overflow folding, and a twit-compatible final modular addition. By deferring carry propagation to the final stage, the resulting organization avoids the long critical paths characteristic of conventional multiply-then-reduce designs. To demonstrate the effectiveness of the proposed approach, we study a modulus set with 5-bit residue channels and show that, owing to the broad admissible range of $δ$, it can provide a sufficiently wide dynamic range. Moreover, additional 8-bit and 11-bit configurations are used to evaluate the proposed approach at larger channel widths. We implement and synthesize the proposed multiplier in a FreePDK 45\,nm flow, and the results show average reductions of 20.5\% in delay, 13.2\% in area, and 28.0\% in power relative to baseline designs. A system-level study further indicates that these circuit-level improvements translate into lower end-to-end latency over a broad range of modular multiplication and addition workloads.
翻译:模乘法是残数系统(RNS)中的基础算术原语,在密码学、信号处理和机器学习加速器使用的RNS数据通路中,通常是延迟、面积和能耗的主要来源。近期工作针对$2^n \pm δ$形式的模数引入了一种基于twit的残数表示(其中$0 \le δ\le 2^{n-1}-1$),并表明该表示能够在整个允许的$δ$范围内实现高效的通用模加法和减法。然而,与该表示兼容的高效模乘法器仍然缺失。本文提出了一种用于RNS通道的基于twit的通用模-$(2^n \pm δ)$乘法器。所提出的架构通过操作数拆分、模部分积生成、进位保存累加、溢出折叠以及兼容twit的最终模加法来计算乘积。通过将进位传播推迟到最终阶段,所得结构避免了传统“先乘后约减”设计中的长关键路径。为证明所提方法的有效性,我们研究了一组具有5位残差通道的模数集,并表明由于$δ$的允许范围广泛,它可以提供足够宽的动态范围。此外,还使用了8位和11位配置来评估所提方法在更大通道宽度下的性能。我们在FreePDK 45 nm工艺下实现并综合了所提出的乘法器,结果显示相对于基线设计,延迟平均降低20.5%,面积平均降低13.2%,功耗平均降低28.0%。进一步的系统级研究表明,这些电路级改进可在广泛的模乘法和加法工作负载中转化为更低的端到端延迟。