To efficiently support Large Language Models (LLMs), modern GPGPU architectures have introduced new features and programming paradigms, such as warp specialization. These features enable temporal overlap between the producer and consumer, as well as between matrix multiplication and activation function operations, substantially improving performance. To conduct effective AI infrastructure and computer architecture research, cycle-accurate simulators that support these new features, together with analytical models that faithfully capture workload characteristics, are essential. However, existing academic tools provide limited support for these emerging requirements. Existing cycle-accurate simulators do not incorporate new NVIDIA GPU features, such as the Tensor Memory Accelerator (TMA), in a timely manner. Moreover, existing analytical models can misestimate DRAM traffic under certain configurations. In this paper, we build a simulation pipeline from FlashAttention-3 kernel instrumentation to cycle-accurate simulation. The simulator achieves a mean absolute percentage error (MAPE) of 5.7\% and a maximum absolute percentage error of 12.7\% against H800. We also provide a theoretical analysis of FlashAttention-3 and explain why existing analytical models can produce inaccurate traffic estimates.
翻译:为高效支持大规模语言模型(LLMs),现代GPGPU架构引入了诸如线程束专业化(warp specialization)等新特性与编程范式。这些特性可实现生产者和消费者之间的时间重叠,以及矩阵乘法与激活函数操作之间的时间重叠,从而显著提升性能。为开展有效的人工智能基础设施与计算机体系结构研究,支持这些新特性的周期精确模拟器,以及能忠实刻画工作负载特征的分析模型至关重要。然而,现有学术工具对这些新兴需求的支持十分有限。现有周期精确模拟器未能及时整合NVIDIA GPU的新特性(如张量内存加速器TMA)。此外,现有分析模型在特定配置下可能错误估计DRAM流量。本文基于FlashAttention-3内核插桩构建了从内核到周期精确模拟的仿真流水线。相较H800,该模拟器的平均绝对百分比误差(MAPE)为5.7%,最大绝对百分比误差为12.7%。我们还对FlashAttention-3进行了理论分析,并阐释了现有分析模型为何会产生不准确的流量估计。