Optimizing GPU kernels presents a significantly greater challenge for large language models (LLMs) than standard code generation tasks, as it requires understanding hardware architecture, parallel optimization strategies, and performance profiling outputs. Most existing LLM-based approaches to kernel generation rely on simple prompting and feedback loops, incorporating hardware awareness only indirectly through profiling feedback. We introduce KernelFoundry, an evolutionary framework that efficiently explores the GPU kernel design space through three key mechanisms: (1) MAP-Elites quality-diversity search with kernel-specific behavioral dimensions to sustain exploration across diverse optimization strategies; (2) meta-prompt evolution, which co-evolves prompts with kernels to uncover task-specific optimization strategies, and (3) template-based parameter optimization to tune kernels to inputs and hardware. We evaluate this framework on KernelBench, robust-kbench, and custom tasks, generating SYCL kernels as a cross-platform GPU programming model and CUDA kernels for comparison to prior work. Our approach consistently outperforms the baseline methods, achieving an average speedup of 2.3x on KernelBench for SYCL. Moreover, KernelFoundry is implemented as a distributed framework with remote access to diverse hardware, enabling rapid benchmarking and featuring a flexible user input layer that supports kernel generation for a wide range of real-world use cases beyond benchmarking.
翻译:优化GPU内核对于大型语言模型(LLM)而言,其挑战性远超标准代码生成任务,因为这需要理解硬件架构、并行优化策略以及性能分析输出。现有基于LLM的内核生成方法大多依赖简单的提示和反馈循环,仅通过性能分析反馈间接地融入硬件感知。本文提出KernelFoundry,一种进化式框架,通过三项关键机制高效探索GPU内核设计空间:(1)采用MAP-Elites质量多样性搜索,结合内核特定的行为维度,以维持跨多样化优化策略的探索;(2)元提示进化,使提示与内核协同进化,从而发掘面向特定任务的优化策略;(3)基于模板的参数优化,根据输入和硬件调整内核配置。我们在KernelBench、robust-kbench及自定义任务上评估该框架,生成作为跨平台GPU编程模型的SYCL内核,并生成CUDA内核以与先前工作进行对比。本方法在各项基准测试中均稳定优于基线方法,在KernelBench的SYCL内核上平均实现2.3倍的加速比。此外,KernelFoundry被实现为具备远程多样化硬件访问能力的分布式框架,支持快速基准测试,并配备灵活的用户输入层,可支持基准测试之外广泛实际应用场景的内核生成。