We present PM2Lat, a fast and generalized framework for accurately predicting the latency of deep neural network models on GPUs, with special focus on NVIDIA. Unlike prior methods that rely on deep learning models or handcrafted heuristics, PM2Lat leverages the Single-Instruction-Multiple-Thread architecture of GPUs to model execution time of DNN models. First, we dive into fine-grained GPU operation modeling by studying computational behavior and memory access patterns. After identifying these characteristics, we found that different GPU kernels exhibit significant performance disparities, even when serving the same purpose. Hence, the core idea of PM2Lat is to differentiate kernels based on their configurations and analyze them accordingly. This kernel-aware modeling enables PM2Lat to achieve consistently low prediction error across diverse data types and hardware platforms. In addition, PM2Lat generalizes beyond standard matrix multiplication to support complex GPU kernels such as Triton, Flash Attention, and Cutlass Attention. Experimental results show that PM2Lat consistently achieves error rates below 10% across different data types and hardware platforms on Transformer models, outperforming the state-of-the-art NeuSight by 10-20% for FP32 and by at least 50% for BF16. When applying to diverse kernels, the error rate is maintained at 3-8%.
翻译:本文提出PM2Lat,一个快速通用的框架,用于精准预测深度神经网络模型在GPU(尤其针对NVIDIA平台)上的执行延迟。与以往依赖深度学习模型或手工启发式方法的技术不同,PM2Lat利用GPU的单指令多线程架构对DNN模型的执行时间进行建模。首先,我们通过研究计算行为与内存访问模式,深入进行细粒度GPU操作建模。在识别这些特征后,我们发现即使功能相同,不同的GPU内核仍表现出显著的性能差异。因此,PM2Lat的核心思想是根据内核配置进行区分并相应分析。这种内核感知建模使PM2Lat能够在不同数据类型和硬件平台上持续实现低预测误差。此外,PM2Lat的泛化能力超越标准矩阵乘法,可支持Triton、Flash Attention、Cutlass Attention等复杂GPU内核。实验结果表明,在Transformer模型上,PM2Lat在不同数据类型和硬件平台上的误差率始终低于10%,其中FP32精度下较当前最优方法NeuSight提升10-20%,BF16精度下提升至少50%。在多样化内核测试中,其误差率稳定保持在3-8%之间。