Dense Matrix Multiplication (MatMul) is arguably one of the most ubiquitous compute-intensive kernels, spanning linear algebra, DSP, graphics, and machine learning applications. Thus, MatMul optimization is crucial not only in high-performance processors but also in embedded low-power platforms. Several Instruction Set Architectures (ISAs) have recently included matrix extensions to improve MatMul performance and efficiency at the cost of added matrix register files and units. In this paper, we propose Matrix eXtension (MX), a lightweight approach that builds upon the open-source RISC-V Vector (RVV) ISA to boost MatMul energy efficiency. Instead of adding expensive dedicated hardware, MX uses the pre-existing vector register file and functional units to create a hybrid vector/matrix engine at a negligible area cost (< 3%), which comes from a compact near-FPU tile buffer for higher data reuse, and no clock frequency overhead. We implement MX on a compact and highly energy-optimized RVV processor and evaluate it in both a Dual- and 64-Core cluster in a 12-nm technology node. MX boosts the Dual-Core's energy efficiency by 10% for a double-precision 64x64x64 matrix multiplication with the same FPU utilization (~97%) and by 25% on the 64-Core cluster for the same benchmark on 32-bit data, with a 56% performance gain.
翻译:稠密矩阵乘法(MatMul)无疑是最普遍的计算密集型核心运算之一,广泛应用于线性代数、数字信号处理、图形学及机器学习等领域。因此,MatMul优化不仅对高性能处理器至关重要,在嵌入式低功耗平台同样具有重要价值。近年来,多种指令集架构(ISA)通过引入矩阵扩展来提升MatMul性能与效率,但代价是增加了矩阵寄存器文件与专用计算单元。本文提出一种轻量级方案——矩阵扩展(MX),其基于开源RISC-V向量(RVV)ISA构建,旨在提升MatMul能效。MX不添加昂贵的专用硬件,而是复用现有向量寄存器文件与功能单元,构建混合型向量/矩阵引擎,其面积开销可忽略不计(低于3%)——该开销来自一个紧凑的近浮点运算单元(FPU)片缓冲器以实现更高数据复用率,且无时钟频率过度开销。我们在一个紧凑型高能效优化RVV处理器上实现MX,并在12纳米工艺节点下的双核与64核集群中对其进行评估。在FPU利用率约97%的双精度64x64x64矩阵乘法中,MX使双核集群能效提升10%;在相同基准测试的32位数据场景下,MX为64核集群带来25%的能效提升与56%的性能增益。