Efficient mixed-precision matrix multiply accumulate (MMA) operations are critical for accelerating deep learning workloads on GPGPUs. However, existing open-source dot product implementations for Tensor Cores rely on discrete arithmetic units, leading to high latency, accumulated rounding errors, and poor resource utilization. To address these challenges, we propose Ten-Four, a scalable mixed-precision fused dot product unit that integrates both the floating-point and integer arithmetic pipelines within a single fused architecture, implemented as part of the open-source RISC-V-based Vortex GPGPU's Tensor Core Unit extension. Our design supports low-precision multiplication in FP16/BF16/FP8/BF8/INT8/INT4 formats and higher-precision accumulation in FP32/INT32, with native support for Microscaling (MX) and sparse lane clock-gating for dynamic power reduction, while matching NVIDIA Tensor Core's numerical accuracy. Ten-Four achieves 4-cycle operation latency at 262.325 MHz Fmax, delivering 134.308 GFLOPS peak throughput per Tensor Core on the AMD Xilinx Alveo U55C FPGA, demonstrating ~3.1x performance improvement over an equivalent Berkeley HardFloat-based implementation at less than 60% the area cost.
翻译:高效的混合精度矩阵乘加(MMA)运算对于加速GPGPU上的深度学习工作负载至关重要。然而,现有张量核心的开源点积实现依赖离散算术单元,导致高延迟、累积舍入误差以及资源利用率低下。针对这些挑战,我们提出Ten-Four——一种可扩展的混合精度融合点积单元,将浮点与整数算术流水线集成于单一融合架构中,并作为开源RISC-V基Vortex GPGPU的张量核心单元扩展实现。该设计支持FP16/BF16/FP8/BF8/INT8/INT4格式的低精度乘法及FP32/INT32格式的高精度累加,原生支持微缩放(MX)技术,并通过稀疏通道时钟门控实现动态功耗降低,同时数值精度与NVIDIA张量核心相当。Ten-Four在262.325 MHz最高频率下实现4周期运算延迟,在AMD Xilinx Alveo U55C FPGA上每个张量核心可提供134.308 GFLOPS峰值吞吐量,相较基于等效Berkeley HardFloat的实现性能提升约3.1倍,而面积开销低于60%。