To efficiently support Large Language Models (LLMs), modern GPGPU architectures have introduced new features and programming paradigms, such as warp specialization. These features enable temporal overlap between the producer and consumer, as well as between matrix multiplication and activation function operations, substantially improving performance. To conduct effective AI infrastructure and computer architecture research, cycle-accurate simulators that support these new features, together with analytical models that faithfully capture workload characteristics, are essential. However, existing academic tools provide limited support for these emerging requirements. Existing cycle-accurate simulators do not incorporate new NVIDIA GPU features, such as the Tensor Memory Accelerator (TMA), in a timely manner. Moreover, existing analytical models can misestimate DRAM traffic under certain configurations. In this paper, we build a simulation pipeline from FlashAttention-3 kernel instrumentation to cycle-accurate simulation. The simulator achieves a mean absolute percentage error (MAPE) of 5.7\% and a maximum absolute percentage error of 12.7\% against H800. We also provide a theoretical analysis of FlashAttention-3 and explain why existing analytical models can produce inaccurate traffic estimates.
翻译:为高效支持大语言模型(LLMs),现代GPGPU架构引入了诸如线程束专业化等新特性与编程范式。这些特性使得生产者与消费者之间、矩阵乘法与激活函数运算之间实现时间重叠,显著提升性能。为开展有效的人工智能基础设施与计算机体系结构研究,既需要支持这些新特性的周期精确模拟器,也需要能忠实刻画工作负载特征的分析模型。然而,现有学术工具对这些新兴需求的支持十分有限。现有周期精确模拟器未及时纳入NVIDIA GPU的新特性(如张量内存加速器TMA)。此外,某些配置下,现有分析模型可能错误估计DRAM流量。本文构建了从FlashAttention-3内核插桩到周期精确模拟的仿真流水线。该模拟器相较于H800的预测平均绝对百分比误差(MAPE)为5.7%,最大绝对百分比误差为12.7%。我们同时提供了FlashAttention-3的理论分析,并解释了现有分析模型为何会产生不准确的流量估计。