Efficient GPU programming is crucial for achieving high performance in deep learning (DL) applications. The performance of GPU programs depends on how data is parallelized across threads and arranged within memory subsystems. The mapping functions describing tensors on GPUs are known as \emph{tensor layouts}. Low-level programming frameworks, such as CUTLASS and Hidet, provide expressive layout abstractions but often require \emph{considerable programming effort} to manually specify optimal layouts. High-level GPU programming languages, such as Triton, rely on compiler heuristics to generate dataflow, layouts, and pipelining strategies in GPU programs. However, the heuristics for dataflow and pipelining strategies are not generalizable to complex operators. To balance expressiveness and programmability, we propose Hexcute, a compiler framework that automates layout synthesis while providing explicit control over dataflow and pipelining. Hexcute formalizes layout synthesis as a constraint programming problem and solves it with a type-inference-based algorithm. This approach enables systematic exploration of optimal layouts and instructions. Our evaluation shows that Hexcute matches the performance of libraries like cuBLAS and FlashAttention on GEMM, Attention, and their variants, while reducing the amount of code by 1.27$\times$-7.94$\times$ compared to CUTLASS. For mixed-type mixture-of-experts (MoE) operators, Hexcute achieves an average speedup of 6.46$\times$ over Triton. In the end-to-end evaluations of vLLM, Hexcute delivers up to 2.60$\times$ speedup on DeepSeek-R1-AWQ and 2.04$\times$ on a Mamba-based model.
翻译:高效GPU编程对于实现深度学习应用的高性能至关重要。GPU程序的性能取决于数据在线程间的并行化方式以及在内存子系统中的组织方式。描述GPU上张量的映射函数被称为\emph{张量布局}。诸如CUTLASS和Hidet等底层编程框架提供了富有表达力的布局抽象,但通常需要\emph{大量的编程工作}来手动指定最优布局。诸如Triton等高级GPU编程语言则依赖编译器启发式方法来生成GPU程序中的数据流、布局和流水线策略。然而,针对数据流和流水线策略的启发式方法难以推广到复杂算子。为了在表达力和可编程性之间取得平衡,我们提出了Hexcute,一个自动化布局合成同时提供对数据流和流水线显式控制的编译器框架。Hexcute将布局合成形式化为一个约束编程问题,并通过基于类型推断的算法进行求解。该方法能够系统地探索最优布局和指令。我们的评估表明,Hexcute在GEMM、Attention及其变体上达到了与cuBLAS和FlashAttention等库相当的性能,同时与CUTLASS相比减少了1.27$\times$-7.94$\times$的代码量。对于混合类型专家混合(MoE)算子,Hexcute相比Triton实现了平均6.46$\times$的加速。在vLLM的端到端评估中,Hexcute在DeepSeek-R1-AWQ上实现了最高2.60$\times$的加速,在基于Mamba的模型上实现了2.04$\times$的加速。