Kronecker Matrix-Matrix Multiplication (Kron-Matmul) is the multiplication of a matrix with the Kronecker Product of several smaller matrices. Kron-Matmul is a core operation for many scientific and machine learning computations. State-of-the-art Kron-Matmul implementations utilize existing tensor algebra operations, such as matrix multiplication, transpose, and tensor matrix multiplication. However, this design choice prevents several Kron-Matmul specific optimizations, thus, leaving significant performance on the table. To address this issue, we present FastKron, an efficient technique for Kron-Matmul on single and multiple GPUs. FastKron is independent of linear algebra operations enabling several new optimizations for Kron-Matmul. Thus, it performs up to 40.7x and 7.85x faster than existing implementations on 1 and 16 GPUs respectively.
翻译:Kronecker矩阵-矩阵乘法(Kron-Matmul)是矩阵与若干较小矩阵的Kronecker积相乘的运算。Kron-Matmul是众多科学计算和机器学习计算的核心操作。现有最先进的Kron-Matmul实现利用现有张量代数运算,如矩阵乘法、转置和张量矩阵乘法。然而,这种设计选择阻碍了若干针对Kron-Matmul的优化,从而显著影响了性能。为解决此问题,我们提出FastKron,一种面向单GPU和多GPU的高效Kron-Matmul技术。FastKron独立于线性代数运算,为Kron-Matmul实现了多项新的优化。因此,与现有实现相比,在1个和16个GPU上其性能分别提升高达40.7倍和7.85倍。