Efficient mixed-precision MMA operations are critical for accelerating deep learning workloads on GPGPUs. However, existing open-source Tensor Core implementations rely on discrete arithmetic unit designs, leading to high latency, accumulated rounding errors, and poor resource utilization. To address these challenges, we propose Ten-Four, a configurable mixed-precision fused dot product unit integrating both floating-point and integer arithmetic pipelines within a unified architecture, implemented as part of the open-source RISC-V-based Vortex GPGPU's Tensor Core Unit extension. It supports low-precision multiplication in TF32/FP16/BF16/FP8/BF8/INT8/INT4 with higher-precision FP32/INT32 accumulation, native Microscaling (MX) support, and sparse lane clock-gating for dynamic power reduction, while matching NVIDIA Tensor Core numerical accuracy. Ten-Four achieves 4-cycle latency at 300 MHz Fmax on the Xilinx U55C FPGA, delivering 130.368 GFLOPS peak throughput per Tensor Core and 2.7x-7.9x speedup over equivalent Berkeley HardFloat and FPnew based implementations at less than 60% the area cost. ASIC synthesis in 7nm FinFET achieves 2.771 TFLOPS/W peak efficiency at 1.58 GHz Fmax.
翻译:高效的混合精度MMA运算是加速GPGPU深度学习负载的关键。然而,现有开源张量核心实现采用离散算术单元设计,导致高延迟、累积舍入误差及资源利用率低下。为应对这些挑战,我们提出Ten-Four——一种可配置的混合精度融合点积单元,在统一架构中集成浮点与整数算术流水线,作为基于开源RISC-V的Vortex GPGPU张量核心单元扩展实现。该单元支持TF32/FP16/BF16/FP8/BF8/INT8/INT4低精度乘法与FP32/INT32高精度累加,原生支持微缩放技术(MX),并通过稀疏通道时钟门控实现动态功耗降低,同时保持与NVIDIA张量核心相当的数值精度。Ten-Four在Xilinx U55C FPGA上以300 MHz最高频率实现4周期延迟,每个张量核心峰值吞吐量达130.368 GFLOPS,相较基于Berkeley HardFloat和FPnew的等效实现,在面积成本降低60%以上的前提下获得2.7倍至7.9倍加速。基于7nm FinFET工艺的ASIC综合结果显示,其在1.58 GHz最高频率下峰值能效达2.771 TFLOPS/W。