The GEneral Matrix Multiplication (GEMM) is one of the essential algorithms in scientific computing. Single-thread GEMM implementations are well-optimised with techniques like blocking and autotuning. However, due to the complexity of modern multi-core shared memory systems, it is challenging to determine the number of threads that minimises the multi-thread GEMM runtime. We present a proof-of-concept approach to building an Architecture and Data-Structure Aware Linear Algebra (ADSALA) software library that uses machine learning to optimise the runtime performance of BLAS routines. More specifically, our method uses a machine learning model on-the-fly to automatically select the optimal number of threads for a given GEMM task based on the collected training data. Test results on two different HPC node architectures, one based on a two-socket Intel Cascade Lake and the other on a two-socket AMD Zen 3, revealed a 25 to 40 per cent speedup compared to traditional GEMM implementations in BLAS when using GEMM of memory usage within 100 MB.
翻译:通用矩阵乘法(GEMM)是科学计算中的核心算法之一。单线程GEMM实现已通过分块和自动调优等技术得到充分优化。然而,由于现代多核共享内存系统的复杂性,确定能够最小化多线程GEMM运行时间的线程数仍具挑战性。本文提出一种概念验证方法,用于构建架构与数据结构感知的线性代数(ADSALA)软件库,该库利用机器学习优化基础线性代数子程序(BLAS)的运行性能。具体而言,我们的方法通过在线机器学习模型,基于已收集的训练数据为给定GEMM任务自动选择最优线程数。在两种不同的高性能计算节点架构(一种基于双路Intel Cascade Lake,另一种基于双路AMD Zen 3)上的测试结果表明,当内存使用量在100 MB以内时,相较于传统BLAS中的GEMM实现,本方法可获得25%至40%的加速比。