This paper investigates the design of parallel general matrix multiplication (GEMM) for a Versal Adaptive Compute Accelerated Platform (ACAP) equipped with a VC1902 system-on-chip and multiple Artificial Intelligence Engines (AIEs). Our efforts aim to port standard optimization techniques applied in the high-performance realization of GEMM on CPUs to the Versal ACAP. In particular, 1) we address the flexible exploitation of the Versal ACA multi-level memory hierarchy; 2) we delve into the efficient use of the vector units in the AIE tiles, proposing an architecture-specific micro-kernel for mixed precision arithmetic to address the strong demand for adaptive-precision inference in deep learning; and 3) we introduce a parallel design for GEMM that spans multiple AIE tiles, enhancing the computational throughput. We conduct experimental profiling, with up to 32 AI Engines, that demonstrates the high parallel scalability of the solution.
翻译:本文研究在配备VC1902系统级芯片与多个人工智能引擎(AIE)的Versal自适应计算加速平台(ACAP)上设计并行通用矩阵乘法(GEMM)的方法。我们致力于将CPU上高性能GEMM实现中采用的标准优化技术移植至Versal ACAP平台。具体而言:1) 我们探讨了对Versal ACAP多层次存储层级的灵活利用;2) 深入研究了AIE瓦片向量单元的高效使用,提出一种面向混合精度算术的架构特异性微内核,以满足深度学习中对自适应精度推理的强烈需求;3) 引入了一种跨多个AIE瓦片的GEMM并行设计方案,以提升计算吞吐量。通过使用多达32个AI引擎的实验性能分析,验证了该方案具备高度的并行可扩展性。