Understanding GEMM Performance and Energy on NVIDIA Ada Lovelace: A Machine Learning-Based Analytical Approach

Analytical framework for predicting General Matrix Multiplication (GEMM) performance on modern GPUs, focusing on runtime, power consumption, and energy efficiency. Our study employs two approaches: a custom-implemented tiled matrix multiplication kernel for fundamental analysis, and NVIDIA's CUTLASS library for comprehensive performance data collection across advanced configurations. Using the NVIDIA RTX 4070 as our experimental platform, we developed a Random Forest-based prediction model with multi-output regression capability. Through analysis of both naive tiled matrix multiplication with varying tile sizes (1 to 32) and 16,128 CUTLASS GEMM operations across diverse configurations, we identified critical performance patterns related to matrix dimensions, thread block configurations, and memory access patterns. Our framework achieved exceptional accuracy with an R^2 score of 0.98 for runtime prediction (mean error 15.57%) and 0.78 for power prediction (median error 5.42%). The system successfully predicts performance across matrix sizes, demonstrating robust scaling behavior. Our results show that optimal tile size selection can improve performance by up to 3.2x while reducing power consumption by 22% compared to baseline configurations. Analysis of shared memory utilization and SM occupancy reveals that tile sizes of 16x16 achieve the best balance between parallelism and resource usage. The implementation of our framework, including prediction models and analysis tools, is available as an open-source project at GPPerf [https://github.com/pavlyhalim/GPPerf].

翻译：本文提出了一种用于预测现代GPU上通用矩阵乘法（GEMM）性能的分析框架，重点关注运行时间、功耗和能效。我们的研究采用两种方法：通过自定义实现的平铺矩阵乘法内核进行基础分析，并利用NVIDIA CUTLASS库收集涵盖高级配置的综合性能数据。以NVIDIA RTX 4070为实验平台，我们开发了具备多输出回归能力的基于随机森林的预测模型。通过分析具有不同平铺尺寸（1至32）的朴素平铺矩阵乘法，以及涵盖多样化配置的16,128次CUTLASS GEMM操作，我们识别出与矩阵维度、线程块配置和内存访问模式相关的关键性能模式。该框架在运行时间预测上取得了R^2分数0.98（平均误差15.57%）、功耗预测R^2分数0.78（中位误差5.42%）的优异精度。该系统能够成功预测不同矩阵尺寸下的性能，展现了稳健的扩展行为。我们的结果表明，与基线配置相比，最优平铺尺寸选择可将性能提升高达3.2倍，同时降低22%的功耗。对共享内存利用率和流多处理器占用情况的分析表明，16x16的平铺尺寸在并行性与资源使用之间达到了最佳平衡。本框架的实现（包括预测模型和分析工具）已在开源项目GPPerf中发布 [https://github.com/pavlyhalim/GPPerf]。