两阶段GPU内核调优器：融合语义重构与基于搜索的优化 (A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization)

GPU code optimization is a key performance bottleneck for HPC workloads as well as large-model training and inference. Although compiler optimizations and hand-written kernels can partially alleviate this issue, achieving near-hardware-limit performance still relies heavily on manual code refactoring and parameter tuning. Recent progress in LLM-agent-based kernel generation and optimization has been reported, yet many approaches primarily focus on direct code rewriting, where parameter choices are often implicit and hard to control, or require human intervention, leading to unstable performance gains. This paper introduces a template-based rewriting layer on top of an agent-driven iterative loop: kernels are semantically refactored into explicitly parameterizable templates, and template parameters are then optimized via search-based autotuning, yielding more stable and higher-quality speedups. Experiments on a set of real-world kernels demonstrate speedups exceeding 3x in the best case. We extract representative CUDA kernels from SGLang as evaluation targets; the proposed agentic tuner iteratively performs templating, testing, analysis, and planning, and leverages profiling feedback to execute constrained parameter search under hardware resource limits. Compared to agent-only direct rewriting, the template-plus-search design significantly reduces the randomness of iterative optimization, making the process more interpretable and enabling a more systematic approach toward high-performance configurations. The proposed method can be further extended to OpenCL, HIP, and other backends to deliver automated performance optimization for real production workloads.

翻译：GPU代码优化是高性能计算负载以及大模型训练与推理的关键性能瓶颈。尽管编译器优化与手写内核能够部分缓解此问题，但实现接近硬件极限的性能仍严重依赖手动代码重构与参数调优。近期基于LLM智能体的内核生成与优化研究已取得进展，然而许多方法主要聚焦于直接代码重写，其中参数选择往往隐含且难以控制，或需要人工干预，导致性能提升不稳定。本文在智能体驱动的迭代循环之上引入基于模板的重写层：内核经语义重构转化为显式可参数化的模板，随后通过基于搜索的自动调优对模板参数进行优化，从而获得更稳定且更高质量的加速效果。在一组真实场景内核上的实验表明，最佳情况下加速比可超过3倍。我们从SGLang中提取代表性CUDA内核作为评估对象；所提出的智能体调优器迭代执行模板化、测试、分析与规划，并利用性能剖析反馈在硬件资源限制下执行约束参数搜索。与仅依赖智能体的直接重写相比，模板加搜索的设计显著降低了迭代优化的随机性，使过程更具可解释性，并为实现高性能配置提供了更系统化的途径。该方法可进一步扩展至OpenCL、HIP及其他后端，为实际生产负载提供自动化性能优化。