In recent years, general matrix-matrix multiplication with non-regular-shaped input matrices has been widely used in many applications like deep learning and has drawn more and more attention. However, conventional implementations are not suited for non-regular-shaped matrix-matrix multiplications, and few works focus on optimizing tall-and-skinny matrix-matrix multiplication on CPUs. This paper proposes an auto-tuning framework, AutoTSMM, to build high-performance tall-and-skinny matrix-matrix multiplication. AutoTSMM selects the optimal inner kernels in the install-time stage and generates an execution plan for the pre-pack tall-and-skinny matrix-matrix multiplication in the runtime stage. Experiments demonstrate that AutoTSMM achieves competitive performance comparing to state-of-the-art tall-and-skinny matrix-matrix multiplication. And, it outperforms all conventional matrix-matrix multiplication implementations.
翻译:近年来,具有非规则形状输入矩阵的通用矩阵乘法在深度学习等众多应用中得到广泛使用,并引起越来越多的关注。然而,传统实现方案并不适用于非规则形状的矩阵乘法,且鲜有研究专注于在CPU上优化瘦高型矩阵-矩阵乘法。本文提出一种自动调优框架AutoTSMM,用于构建高性能的瘦高型矩阵-矩阵乘法。AutoTSMM在安装时阶段选择最优内部计算核,并在运行时阶段为预打包的瘦高型矩阵-矩阵乘法生成执行计划。实验表明,与最先进的瘦高型矩阵-矩阵乘法相比,AutoTSMM实现了具有竞争力的性能,并且其性能优于所有传统矩阵-矩阵乘法实现方案。