The rapid adoption of low-precision arithmetic in artificial intelligence and edge computing has created a strong demand for energy-efficient and flexible floating-point multiply-accumulate (MAC) units. This paper presents a dual-precision floating-point MAC processing element supporting FP8 (E4M3, E5M2) and FP4 (2 x E2M1, 2 x E1M2) formats, specifically optimized for low-power and high-throughput AI workloads. The proposed architecture employs a novel bit-partitioning technique that enables a single 4-bit unit multiplier to operate either as a standard 4 x 4 multiplier for FP8 or as two parallel 2 x 2 multipliers for 2-bit operands, achieving maximum hardware utilization without duplicating logic. Implemented in 28 nm technology, the proposed PE achieves an operating frequency of 1.94 GHz with an area of 0.00396 mm^2 and power consumption of 2.13 mW, resulting in up to 60.4% area reduction and 86.6% power savings compared to state-of-the-art designs, making it well suited for energy-constrained AI inference and mixed-precision computing applications when deployed within larger accelerator architectures.
翻译:人工智能与边缘计算中低精度算术的快速普及,催生了高能效、高灵活性的浮点乘累加(MAC)单元需求。本文提出一种支持FP8(E4M3,E5M2)和FP4(2×E1M2,2×E2M1)格式的双精度浮点MAC处理单元,针对低功耗、高吞吐量AI工作负载进行了专门优化。所提架构采用新型位分割技术,使单个4位单元乘法器既可作为标准4×4乘法器用于FP8运算,也可作为两个并行2×2乘法器处理2位操作数,在不增加逻辑复制的情况下实现硬件利用率最大化。基于28纳米工艺实现,该处理单元工作频率达1.94GHz,面积0.00396mm²,功耗2.13mW,相较现有最新设计可实现最高60.4%的面积缩减和86.6%的功耗节省,使其在部署于更大规模加速器架构时,特别适用于能量受限的AI推理与混合精度计算应用。