FP8 low-precision formats have gained significant adoption in Transformer inference and training. However, existing digital compute-in-memory (DCIM) architectures face challenges in supporting variable FP8 aligned-mantissa bitwidths, as unified alignment strategies and fixed-precision multiply-accumulate (MAC) units struggle to handle input data with diverse distributions. This work presents a flexible FP8 DCIM accelerator with three innovations: (1) a dynamic shift-aware bitwidth prediction (DSBP) with on-the-fly input prediction that adaptively adjusts weight (2/4/6/8b) and input (2$\sim$12b) aligned-mantissa precision; (2) a FIFO-based input alignment unit (FIAU) replacing complex barrel shifters with pointer-based control; and (3) a precision-scalable INT MAC array achieving flexible weight precision with minimal overhead. Implemented in 28nm CMOS with a 64$\times$96 CIM array, the design achieves 20.4 TFLOPS/W for fixed E5M7, demonstrating 2.8$\times$ higher FP8 efficiency than previous work while supporting all FP8 formats. Results on Llama-7b show that the DSBP achieves higher efficiency than fixed bitwidth mode at the same accuracy level on both BoolQ and Winogrande datasets, with configurable parameters enabling flexible accuracy-efficiency trade-offs.
翻译:FP8低精度格式在Transformer推理与训练中已获得广泛应用。然而,现有数字存内计算架构在支持可变FP8对齐尾数位宽方面面临挑战,因为统一的对齐策略和固定精度的乘累加单元难以处理具有多样化分布的输入数据。本研究提出了一种灵活的FP8数字存内计算加速器,包含三项创新:(1) 动态移位感知位宽预测技术,通过实时输入预测自适应调整权重(2/4/6/8b)和输入(2$\sim$12b)的对齐尾数精度;(2) 基于FIFO的输入对齐单元,用基于指针的控制取代复杂的桶形移位器;(3) 精度可扩展的INT乘累加阵列,以最小开销实现灵活的权重精度。该设计采用28nm CMOS工艺实现,集成64$\times$96存内计算阵列,在固定E5M7格式下达到20.4 TFLOPS/W能效,相比先前工作将FP8效率提升2.8$\times$,同时支持所有FP8格式。在Llama-7b模型上的实验表明,在BoolQ和Winogrande数据集上,动态移位感知位宽预测在相同精度水平下比固定位宽模式获得更高效率,其可配置参数支持灵活的精度-效率权衡。