The increasing computational and memory requirements of Deep Learning (DL) workloads has led to outstanding innovations in hardware architectures. An archetype of such architectures is the novel Versal AI Engine (AIE) by AMD/Xilinx. The AIE comprises multiple programmable processors optimized for vector-based algorithms. An AIE array consisting of 400 processor cores, operating at 1.25 GHz is able to deliver a peak throughput of 8 TFLOPs for 32-bit floating-point (fp32), and 128 TOPs for 8-bit integer (int8) precision. In this work, we propose MaxEVA: a novel framework to efficiently map Matrix Multiplication (MatMul) workloads on Versal AIE devices. Our framework maximizes the performance and energy efficiency of MatMul applications by efficiently exploiting features of the AIE architecture and resolving performance bottlenecks from multiple angles. When demonstrating on the VC1902 device of the VCK190 board, MaxEVA accomplishes up to 5.44 TFLOPs and 77.01 TOPs throughput for fp32 and int8 precisions, respectively. In terms of energy efficiency, MaxEVA attains up to 85.11 GFLOPs/W for fp32, and 1.73 TOPs/W for int8. Our proposed method substantially outperforms the state-of-the-art approach by exhibiting up to 2.19x throughput gain and 29.4% higher energy efficiency. The MaxEVA framework provides notable insights to fill the knowledge gap in effectively designing MatMul-based DL workloads on the new Versal AIE devices.
翻译:深度学习工作负载日益增长的计算和内存需求推动了硬件架构的杰出创新。此类架构的典型代表是AMD/Xilinx推出的新型Versal AI引擎。该AI引擎包含多个针对向量算法优化的可编程处理器。由400个处理器核心组成的AI引擎阵列,在1.25 GHz频率下运行时,可提供8 TFLOPs的32位浮点峰值吞吐量和128 TOPs的8位整数峰值吞吐量。本文提出MaxEVA:一种新型框架,用于在Versal AIE设备上高效映射矩阵乘法工作负载。该框架通过充分利用AIE架构特性并从多角度解决性能瓶颈,最大化矩阵乘法应用的性能和能效。在VCK190开发板的VC1902器件上演示时,MaxEVA针对fp32和int8精度分别实现了高达5.44 TFLOPs和77.01 TOPs的吞吐量。在能效方面,MaxEVA对fp32达到85.11 GFLOPs/W,对int8达到1.73 TOPs/W。我们提出的方法显著优于现有技术,吞吐量提升高达2.19倍,能效提高29.4%。MaxEVA框架为在新Versal AIE设备上有效设计基于矩阵乘法的深度学习工作负载提供了重要见解,填补了相关知识空白。