As the increasing complexity of Neural Network(NN) models leads to high demands for computation, AMD introduces a heterogeneous programmable system-on-chip (SoC), i.e., Versal ACAP architectures featured with programmable logic (PL), CPUs, and dedicated AI engines (AIE) ASICs which has a theoretical throughput up to 6.4 TFLOPs for FP32, 25.6 TOPs for INT16 and 102.4 TOPs for INT8. However, the higher level of complexity makes it non-trivial to achieve the theoretical performance even for well-studied applications like matrix-matrix multiply. In this paper, we provide AutoMM, an automatic white-box framework that can systematically generate the design for MM accelerators on Versal which achieves 3.7 TFLOPs, 7.5 TOPs, and 28.2 TOPs for FP32, INT16, and INT8 data type respectively. Our designs are tested on board and achieve gains of 7.20x (FP32), 3.26x (INT16), 6.23x (INT8) energy efficiency than AMD U250 FPGA, 2.32x (FP32) than Nvidia Jetson TX2 GPU, 1.06x (FP32), 1.70x (INT8) than Nvidia A100 GPU.
翻译:随着神经网络模型复杂度持续攀升带来的高计算需求,AMD提出了异构可编程系统级芯片(SoC)架构——Versal ACAP,该架构集成了可编程逻辑(PL)、CPU及专用AI引擎(AIE)ASIC,理论吞吐量分别达到FP32 6.4 TFLOPs、INT16 25.6 TOPS和INT8 102.4 TOPS。然而,即便对于矩阵乘法这类成熟应用,实现理论性能仍颇具挑战。本文提出AutoMM——一种自动化白盒框架,能够系统化生成Versal上的矩阵乘法加速器设计方案,分别实现FP32 3.7 TFLOPs、INT7.5 TOPS及INT8 28.2 TOPS的实际性能。经板级验证,与AMD U250 FPGA相比,本设计在FP32、INT16、INT8数据类型上分别实现7.20倍、3.26倍、6.23倍的能效提升;相较Nvidia Jetson TX2 GPU,FP32能效提升2.32倍;相较Nvidia A100 GPU,FP32能效提升1.06倍,INT8能效提升1.70倍。