Fast multipliers with large bit widths can occupy significant silicon area, which, in turn, can be minimized by employing multi-cycle multipliers. This paper introduces architectures and parameterized Verilog circuit generators for 2-cycle integer multipliers. When implementing an algorithm in hardware, it is common that less than 1 multiplication needs to be performed per clock cycle. It is also possible that the multiplications per cycle is a fractional number, e.g., 3.5. In such case, we can surely use 4 multipliers, each with a throughput of 1 result per cycle. However, we can instead use 3 such multipliers plus a multiplier with a throughput of 1/2. Resource sharing allows a multiplier with a lower throughput to be smaller, hence area savings. These multipliers offer customization in regards to the latency and clock frequency. All proposed designs were automatically synthesized and tested for various bit widths. Two main architectures are presented in this work, and each has several variants. Our 2-cycle multipliers offer up to 21%, 42%, 32%, 41%, and 48% of area savings for bit widths of 8, 16, 32, 64, and 128, with respect to synthesizing the "*" operator with throughput of 1. Furthermore, some of the proposed designs also offer power savings under certain conditions.
翻译:大位宽快速乘法器会占用大量硅片面积,而采用多周期乘法器可有效缩减这一面积。本文介绍了面向2周期整数乘法器的架构和参数化Verilog电路生成器。在硬件实现算法时,通常每个时钟周期所需执行的乘法运算次数小于1,该次数也可能是小数(例如3.5)。这种情况下,我们固然可以使用4个吞吐量为每周期1个结果的乘法器,但也可以改用3个同类乘法器加1个吞吐量为1/2的乘法器。资源共享使得低吞吐量乘法器能实现更小的面积,从而节省芯片面积。这些乘法器在延迟和时钟频率方面支持定制化设计。所有设计方案均针对不同位宽自动完成综合与测试。本文提出两种主要架构,每种架构包含多个变体。与吞吐量为1的"*"运算符综合结果相比,我们提出的2周期乘法器在8位、16位、32位、64位和128位宽下分别实现了最高21%、42%、32%、41%和48%的面积节省。此外,部分设计方案在特定条件下还能降低功耗。