The rapid adoption of low-precision arithmetic in artificial intelligence and edge computing has created a strong demand for energy-efficient and flexible floating-point multiply-accumulate (MAC) units. This paper presents a fully pipelined dual-precision floating-point MAC processing engine supporting FP8 formats (E4M3, E5M2) and FP4 formats (E2M1, E1M2), specifically optimized for low-power and high-throughput AI workloads. The proposed architecture employs a novel bit-partitioning technique that enables a single 4-bit unit multiplier to operate either as a standard 4x4 multiplier for FP8 or as two parallel 2x2 multipliers for 2-bit operands, achieving 100 percent hardware utilization without duplicating logic. Implemented in 28 nm technology, the proposed processing engine achieves an operating frequency of 1.94 GHz with an area of 0.00396 mm^2 and power consumption of 2.13 mW, resulting in up to 60.4 percent area reduction and 86.6 percent power savings compared to state-of-the-art designs.
翻译:人工智能与边缘计算中低精度运算的快速普及,催生了高效节能且灵活配置的浮点乘累加单元强烈需求。本文提出一种全流水线双精度浮点MAC处理引擎,支持FP8格式(E4M3、E5M2)与FP4格式(E2M1、E1M2),专为低功耗、高吞吐量AI工作负载优化。所提架构采用新型位分割技术,使单个4位单元乘法器既能作为标准4x4乘法器处理FP8运算,又可作为两个并行2x2乘法器处理2位操作数,在不复制逻辑电路的情况下实现100%硬件利用率。基于28纳米工艺实现的该处理引擎,工作频率达1.94 GHz,占用面积0.00396 mm^2,功耗2.13 mW,相较于现有最优设计,面积缩减最高达60.4%,功耗节省最高达86.6%。