High-performance attention kernels are essential for Large Language Models. This paper presents analysis of CuTile-based Flash Attention memory behavior and a technique to improve its cache performance. In particular, our analysis on the NVIDIA GB10 (Grace Blackwell) identifies the main cause of L2 cache miss. Leveraging this insight, we introduce a new programming technique called Sawtooth Wavefront Reordering that reduces L2 misses. We validate it in both CUDA and CuTile, observing 50\% or greater reduction in L2 misses and up to 60\% increase in throughput on GB10.
翻译:高性能注意力核对于大型语言模型至关重要。本文分析了基于CuTile的FlashAttention内存行为,并提出一种提升其缓存性能的技术。具体而言,我们在NVIDIA GB10(Grace Blackwell)平台上的分析识别了L2缓存未命中的主要原因。基于这一发现,我们引入了一种称为锯齿波阵面重排序的新型编程技术,该技术可减少L2未命中。我们在CUDA和CuTile环境中均进行了验证,观察到在GB10平台上L2未命中减少50%以上,吞吐量最高提升60%。