FP8 low-precision formats have gained significant adoption in Transformer inference and training. However, existing digital compute-in-memory (DCIM) architectures face challenges in supporting variable FP8 aligned-mantissa bitwidths, as unified alignment strategies and fixed-precision multiply-accumulate (MAC) units struggle to handle input data with diverse distributions. This work presents a flexible FP8 DCIM accelerator with three innovations: (1) a dynamic shift-aware bitwidth prediction (DSBP) with on-the-fly input prediction that adaptively adjusts weight (2/4/6/8b) and input (2$\sim$12b) aligned-mantissa precision; (2) a FIFO-based input alignment unit (FIAU) replacing complex barrel shifters with pointer-based control; and (3) a precision-scalable INT MAC array achieving flexible weight precision with minimal overhead. Implemented in 28nm CMOS with a 64$\times$96 CIM array, the design achieves 20.4 TFLOPS/W for fixed E5M7, demonstrating 2.8$\times$ higher FP8 efficiency than previous work while supporting all FP8 formats. Results on Llama-7b show that the DSBP achieves higher efficiency than fixed bitwidth mode at the same accuracy level on both BoolQ and Winogrande datasets, with configurable parameters enabling flexible accuracy-efficiency trade-offs.
翻译:FP8低精度格式在Transformer推理与训练中已获得广泛应用。然而,现有数字存内计算(DCIM)架构在支持可变FP8对齐尾数位宽时面临挑战,其统一对齐策略与定点乘累加(MAC)单元难以处理具有不同分布特征的输入数据。本文提出一种灵活FP8 DCIM加速器,包含三项创新:(1)动态移位感知位宽预测(DSBP)机制,通过在线输入预测自适应调整权重(2/4/6/8b)与输入(2$\sim$12b)的对齐尾数精度;(2)基于FIFO的输入对齐单元(FIAU),以指针控制替代复杂桶形移位器;(3)精度可扩展的INT MAC阵列,以极小开销实现灵活权重精度。该设计基于28nm CMOS工艺,采用64$\times$96 CIM阵列,在固定E5M7模式下达到20.4 TFLOPS/W能效,相比现有工作FP8效率提升2.8$\times$,并支持所有FP8格式。基于Llama-7b的测试结果表明,在BoolQ与Winogrande数据集上,DSBP在相同精度下可获得高于固定位宽模式的效率,其可配置参数支持灵活的精度-效率权衡。