Fast Matrix Multiplication via Compiler-only Layered Data Reorganization and Intrinsic Lowering

The resurgence of machine learning has increased the demand for high-performance basic linear algebra subroutines (BLAS), which have long depended on libraries to achieve peak performance on commodity hardware. High-performance BLAS implementations rely on a layered approach that consists of tiling and packing layers, for data (re)organization, and micro kernels that perform the actual computations. The creation of high-performance micro kernels requires significant development effort to write tailored assembly code for each architecture. This hand optimization task is complicated by the recent introduction of matrix engines by IBM's POWER10 MMA, Intel AMX, and Arm ME to deliver high-performance matrix operations. This paper presents a compiler-only alternative to the use of high-performance libraries by incorporating, to the best of our knowledge and for the first time, the automatic generation of the layered approach into LLVM, a production compiler. Modular design of the algorithm, such as the use of LLVM's matrix-multiply intrinsic for a clear interface between the tiling and packing layers and the micro kernel, makes it easy to retarget the code generation to multiple accelerators. The use of intrinsics enables a comprehensive performance study. In processors without hardware matrix engines, the tiling and packing delivers performance up to 22x (Intel), for small matrices, and more than 6x (POWER9), for large matrices, faster than PLuTo, a widely used polyhedral optimizer. The performance also approaches high-performance libraries and is only 34% slower than OpenBLAS and on-par with Eigen for large matrices. With MMA in POWER10 this solution is, for large matrices, over 2.6x faster than the vector-extension solution, matches Eigen performance, and achieves up to 96% of BLAS peak performance.

翻译：机器学习的复兴推动了高性能基本线性代数子程序（BLAS）的需求增长，这类子程序长期以来依赖专用库才能在通用硬件上达到峰值性能。高性能BLAS实现采用分层架构，包含用于数据（重）组织的分块与打包层，以及执行实际计算的微内核。高性能微内核的创建需要针对不同架构编写汇编代码，耗费大量开发精力。随着IBM POWER10 MMA、Intel AMX和Arm ME等矩阵引擎的引入，这种手工优化任务变得更为复杂。本文提出了一种完全基于编译器的替代方案，据我们所知，首次将这种分层方法自动生成集成到LLVM生产编译器中。该算法的模块化设计——例如利用LLVM的矩阵乘法内嵌指令为分块/打包层与微内核提供清晰接口——使得代码生成可轻松适配多种加速器。通过内嵌指令实现的研究揭示了全面的性能特征。在不含硬件矩阵引擎的处理器上，分块与打包技术相比广泛使用的多面体优化工具PLuTo，可实现最高22倍（Intel平台，小矩阵）和超过6倍（POWER9平台，大矩阵）的性能提升。其性能已接近高性能库，对大矩阵而言仅比OpenBLAS慢34%，与Eigen性能持平。在配备MMA的POWER10平台上，该方案对大矩阵的性能比向量扩展方案快2.6倍以上，与Eigen性能相当，并达到BLAS峰值性能的96%。