Machine Learning (ML) operators are the building blocks to design ML models with various target applications. GEneral Matrix Multiplication (GEMM) operators are the backbone of ML models. They are notorious for being computationally expensive requiring billions of multiply-and-accumulate. Therefore, significant effort has been put to study and optimize the GEMM operators in order to speed up the execution of ML models. GPUs and accelerators are widely deployed to accelerate ML workloads by optimizing the execution of GEMM operators. Nonetheless, the performance of NonGEMM operators have not been studied as thoroughly as GEMMs. Therefore, this paper describes \bench, a benchmark to study NonGEMM operators. We first construct \bench using popular ML workloads from different domains, then perform case studies on various grade GPU platforms to analyze the behavior of NonGEMM operators in GPU accelerated systems. Finally, we present some key takeaways to bridge the gap between GEMM and NonGEMM operators and to offer the community with potential new optimization directions.
翻译:机器学习(ML)算子是为设计面向不同目标应用的ML模型而构建的基本单元。通用矩阵乘法(GEMM)算子是ML模型的骨干,因其需要数十亿次乘累加运算而著称的高计算成本。因此,学术界投入了大量精力研究和优化GEMM算子,以加速ML模型的执行。GPU和加速器通过优化GEMM算子的执行,被广泛部署以加速ML工作负载。然而,非GEMM算子的性能尚未得到与GEMM同等深入的研究。为此,本文描述了\bench,一个用于研究非GEMM算子的基准测试工具。我们首先利用来自不同领域的流行ML工作负载构建\bench,然后在多种级别的GPU平台上进行案例研究,分析非GEMM算子在GPU加速系统中的行为。最后,我们提出了一些关键结论,以弥合GEMM与非GEMM算子之间的差距,并为学界提供潜在的新优化方向。