Compared to the first generation of deep neural networks, dominated by regular, compute-intensive kernels such as matrix multiplications (MatMuls) and convolutions, modern decoder-based transformers interleave attention, normalization, and data-dependent control flow. This demands flexible accelerators, a requirement met by scalable, highly energy-efficient shared-L1-memory vector processing element (VPE) clusters. Meanwhile, the ever-growing size and bandwidth needs of state-of-the-art models make reduced-precision formats increasingly attractive. Microscaling (MX) data formats, based on block floating-point (BFP) representations, have emerged as a promising solution to reduce data volumes while preserving accuracy. However, MX semantics are poorly aligned with vector execution: block scaling and multi-step mixed-precision operations break the regularity of vector pipelines, leading to underutilized compute resources and performance degradation. To address these challenges, we propose VMXDOTP, a RISC-V Vector (RVV) 1.0 instruction set architecture (ISA) extension for efficient MX dot product execution, supporting MXFP8 and MXFP4 inputs, FP32 and BF16 accumulation, and software-defined block sizes. A VMXDOTP-enhanced VPE cluster achieves up to 97 % utilization on MX-MatMul. Implemented in 12 nm FinFET, it achieves up to 125 MXFP8-GFLOPS and 250 MXFP4-GFLOPS, with 843/1632 MXFP8/MXFP4-GFLOPS/W at 1 GHz, 0.8 V, and only 7.2 % area overhead. Our design yields up to 7.0x speedup and 4.9x energy efficiency with respect to software-emulated MXFP8-MatMul. Compared with prior MX engines, VMXDOTP supports variable block sizes, is up to 1.4x more area-efficient, and delivers up to 2.1x higher energy efficiency.
翻译:与第一代以矩阵乘法(MatMul)和卷积等规则、计算密集型内核为主的深度神经网络相比,基于解码器的现代Transformer模型交织了注意力、归一化和数据相关的控制流。这需要灵活的加速器,而可扩展、高能效的共享L1内存向量处理单元(VPE)集群满足了这一需求。同时,最先进模型不断增长的规模和带宽需求使得低精度格式日益具有吸引力。基于块浮点(BFP)表示的微缩放(MX)数据格式已成为一种在保持精度的同时减少数据量的有前景的解决方案。然而,MX语义与向量执行模式并不匹配:块缩放和多步骤混合精度操作破坏了向量流水线的规则性,导致计算资源利用不足和性能下降。为应对这些挑战,我们提出了VMXDOTP,一种RISC-V向量(RVV)1.0指令集架构(ISA)扩展,用于高效执行MX点积运算,支持MXFP8和MXFP4输入、FP32和BF16累加以及软件定义的块大小。采用VMXDOTP增强的VPE集群在MX-MatMul上实现了高达97%的利用率。在12 nm FinFET工艺中实现,该设计在1 GHz、0.8 V工作条件下,实现了高达125 MXFP8-GFLOPS和250 MXFP4-GFLOPS的峰值性能,能效达到843/1632 MXFP8/MXFP4-GFLOPS/W,面积开销仅为7.2%。与软件模拟的MXFP8-MatMul相比,我们的设计实现了高达7.0倍的加速和4.9倍的能效提升。与先前的MX引擎相比,VMXDOTP支持可变块大小,面积效率提升高达1.4倍,能效提升高达2.1倍。