Consecutive matrix multiplications are commonly used in graph neural networks and sparse linear solvers. These operations frequently access the same matrices for both reading and writing. While reusing these matrices improves data locality, it presents a challenge due to the irregular dependencies between iterations across the two multiplication operations. Existing fusion methods often introduce excessive synchronization overhead or overlapped computations with limited benefits. This paper proposes tile fusion, a runtime approach that fuses tiles of the two matrix-matrix multiplications, where at least one of the involved matrices is sparse. Tile fusion aims to improve data locality while providing sufficient workload for cores in shared-memory multi-core processors. For a pair of matrix-matrix multiplications, tile fusion outperforms unfused baseline and MKL implementations with a geometric mean speedup of 1.97$\times$ 1.64$\times$, respectively, on multi-core CPUs.
翻译:在图像神经网络和稀疏线性求解器中,连续矩阵乘法运算被广泛使用。这些操作频繁地对同一矩阵进行读写访问。尽管重用这些矩阵能够提升数据局部性,但由于两次乘法运算迭代之间存在不规则依赖关系,这带来了挑战。现有的融合方法通常引入过度的同步开销,或产生收益有限的重叠计算。本文提出分块融合方法,这是一种运行时技术,将两个矩阵-矩阵乘法(其中至少一个参与矩阵为稀疏矩阵)的分块进行融合。分块融合旨在提升数据局部性的同时,为共享内存多核处理器中的核心提供充足的工作负载。对于一对矩阵-矩阵乘法运算,分块融合在多数核CPU上相比未融合的基准实现和MKL实现,分别取得了1.97$\times$和1.64$\times$的几何平均加速比。