The rapid adoption of Large Language Models (LLMs) has made GPU inference efficiency an increasingly critical system concern. The runtime of LLM workloads is largely dominated by tile-based kernels, particularly General Matrix Multiplications (GEMMs). Although these kernels are highly optimized, their performance remains sensitive to a large space of runtime parameters, such as tile sizes and pipeline stages. The interaction between these parameters and hardware resources leads to a non-convex optimization landscape. Existing approaches to parameter configuration -- including search-based auto-tuning, heuristic rules, and learned cost models -- face a fundamental trade-off between performance optimality and runtime efficiency. In this paper, we present WaveTune, a wave-aware framework for runtime kernel auto-tuning. First, we introduce a unified mapping method to handle input diversity and decompose the configuration space to manage high dimensionality. Second, we develop an analytical wave-aware bilinear model that accurately predicts kernel latency. Third, we design a sparse sampling scheme based on wave structures and a lightweight dual-table retrieval mechanism to minimize runtime overhead. As a result, WaveTune enables precise and efficient runtime configuration for GPU kernels. Across three representative kernels and five GPU architectures, WaveTune consistently achieves near-optimal kernel performance, delivering up to 1.83x kernel-level speedup and up to 1.33x end-to-end TTFT reduction, while reducing runtime decision overhead by five orders of magnitude compared to exhaustive search. These results demonstrate that WaveTune effectively eliminates the traditional trade-off between configuration latency and execution optimality, providing a practical and robust solution for high-performance LLM inference.
翻译:大型语言模型(LLM)的快速普及使GPU推理效率成为日益关键的系统问题。LLM工作负载的运行时间主要由基于分块的内核(尤其是通用矩阵乘法GEMM)主导。尽管这些内核已高度优化,但其性能仍对大量运行时参数(如分块大小和流水线阶段)敏感。参数与硬件资源之间的交互导致非凸优化困境。现有参数配置方法(包括基于搜索的自动调优、启发式规则和学习型成本模型)面临性能最优性与运行时效率之间的根本性权衡。本文提出WaveTune——一种面向运行时内核自动调优的波动感知框架。首先,我们引入统一映射方法处理输入多样性,并通过分解配置空间来应对高维问题。其次,我们开发了可准确预测内核延迟的解析式波动感知双线性模型。第三,我们设计了基于波动结构的稀疏采样方案及轻量级双表检索机制,以最小化运行时开销。最终,WaveTune实现了GPU内核精准高效的运行时配置。在三种代表性内核和五种GPU架构上,WaveTune一致达到近最优内核性能,实现高达1.83倍的内核级加速和1.33倍的端到端TTFT降低,同时运行时决策开销相比穷举搜索降低五个数量级。这些结果表明,WaveTune有效消除了配置延迟与执行最优性之间的传统权衡,为高性能LLM推理提供了实用且鲁棒的解决方案。