As high-performance computing and AI workloads become increasingly dependent on GPUs, maintaining high performance across rapidly evolving hardware generations has become a major challenge. Developers often spend months tuning scientific applications to fully exploit new architectures, navigating a complex optimization space that spans algorithm design, source implementation, compiler flags and pass sequences, and kernel launch parameters. Existing approaches can effectively search parts of this space in isolation, such as launch configurations or compiler settings, but optimizing across the full space still requires substantial human expertise and iterative manual effort. In this paper, we present Record-Remix-Replay (R^3), a hierarchical optimization framework that combines LLM-driven evolutionary search, Bayesian optimization, and record-replay compilation techniques to efficiently explore GPU kernel optimizations from source-level implementation choices down to compiler pass ordering and runtime configuration. By making candidate evaluation fast and scalable, our approach enables practical end-to-end search over optimization dimensions that are typically treated separately. We show that Record-Remix-Replay can optimize full scientific applications better than traditional approaches over kernel parameters and compiler flags, while also being nearly an order of magnitude faster than modern evolutionary search approaches.
翻译:随着高性能计算与AI工作负载日益依赖GPU,如何应对快速演进的硬件世代以维持高性能已成为重大挑战。开发者通常需耗时数月调优科学应用以充分利用新架构,在涵盖算法设计、源代码实现、编译器标志与优化序列、内核启动参数的复杂优化空间中探索。现有方法可有效搜索该空间的局部区域(如启动配置或编译器设置),但跨全空间的优化仍需大量人工专业知识和重复性手动工作。本文提出"记录-重组-重放"(R^3)分层优化框架,结合大语言模型驱动的进化搜索、贝叶斯优化与记录-重放编译技术,可高效探索从源码级实现选择到编译器优化排序及运行时配置的GPU内核优化空间。通过使候选方案评估快速且可扩展,该方法实现了对通常被独立处理的优化维度的实用端到端搜索。实验表明,相较于传统针对内核参数与编译器标志的优化方法,Record-Remix-Replay能更优地优化完整科学应用,同时速度比现代进化搜索方法快近一个数量级。