We explore the utilization of the Apache TVM open source framework to automatically generate a family of algorithms that follow the approach taken by popular linear algebra libraries, such as GotoBLAS2, BLIS and OpenBLAS, in order to obtain high-performance blocked formulations of the general matrix multiplication (GEMM). % In addition, we fully automatize the generation process, by also leveraging the Apache TVM framework to derive a complete variety of the processor-specific micro-kernels for GEMM. This is in contrast with the convention in high performance libraries, which hand-encode a single micro-kernel per architecture using Assembly code. % In global, the combination of our TVM-generated blocked algorithms and micro-kernels for GEMM 1)~improves portability, maintainability and, globally, streamlines the software life cycle; 2)~provides high flexibility to easily tailor and optimize the solution to different data types, processor architectures, and matrix operand shapes, yielding performance on a par (or even superior for specific matrix shapes) with that of hand-tuned libraries; and 3)~features a small memory footprint.
翻译:我们探索利用Apache TVM开源框架自动生成一系列遵循主流线性代数库(如GotoBLAS2、BLIS和OpenBLAS)设计方法的算法,以实现通用矩阵乘法(GEMM)的高性能分块形式。此外,我们还通过Apache TVM框架完全自动化生成过程,推导出面向处理器特定微内核的完整GEMM变体。这与高性能库的传统做法形成鲜明对比——传统方法需对每种架构手动编写单一汇编级微内核。总体而言,我们基于TVM生成的GEMM分块算法与微内核组合:1)提升了可移植性、可维护性,并全面优化了软件生命周期;2)提供了高度灵活性,可轻松针对不同数据类型、处理器架构和矩阵运算形状定制并优化解决方案,其性能与手工调优库相当(针对特定矩阵形状甚至更优);3)具有较小的内存占用。