Artificial intelligence workloads, especially transformer models, exhibit emergent sparsity in which computations perform selective sparse access to dense data. The workloads are inefficient on hardware designed for dense computations and do not map well onto sparse data representations. We build a vectorized and parallel matrix-multiplication system A X B = C that eliminates unnecessary computations and avoids branches based on a runtime evaluation of sparsity. We use a combination of dynamic code lookup to adapt to the specific sparsity encoded in the B matrix and preprocessing of sparsity maps of the A and B matrices to compute conditional branches once for the whole computation. For a wide range of sparsity, from 60% to 95% zeros, our implementation performs fewer instructions and increases performance when compared with Intel MKL's dense or sparse matrix multiply routines. Benefits can be as large as 2 times speedup and 4 times fewer instructions.
翻译:人工智能工作负载,尤其是Transformer模型,会表现出涌现稀疏性,即计算过程对密集数据进行选择性稀疏访问。这类工作在针对密集计算设计的硬件上效率低下,且难以有效映射到稀疏数据表示。我们构建了一个向量化并行的矩阵乘法系统A×B=C,该系统通过运行时稀疏度评估来消除不必要的计算并避免分支。我们结合动态代码查找以适应B矩阵编码的特定稀疏性,以及对A和B矩阵的稀疏性映射进行预处理,从而为整个计算一次性计算条件分支。在60%至95%为零的广泛稀疏度范围内,与Intel MKL的密集或稀疏矩阵乘法例程相比,我们的实现减少了指令执行次数并提升了性能。其优势可达到高达2倍的加速比和4倍的指令减少量。