ARGUS: Agentic GPU Optimization Guided by Data-Flow Invariants

LLM-based coding agents can generate functionally correct GPU kernels, yet their performance remains far below hand-optimized libraries on critical computations such as matrix multiplication, attention, and Mixture-of-Experts (MoE). Peak GPU performance requires coordinated reasoning over tightly coupled optimizations, including tiling, shared-memory staging, software pipelining, and instruction scheduling, while existing agents rely on sparse pass/fail feedback, leaving them unable to diagnose global constraint violations. We present Argus, an agentic framework that addresses this through data-flow invariants: compile-time specifications encoding how data must be choreographed throughout kernel execution. Argus introduces a tile-based, Pythonic DSL exposing hardware instructions and compiler policies while hiding low-level representations. The DSL provides tag functions to propagate symbolic annotations through data and control flow, and tag assertions to enforce relational constraints at use sites. When violations occur, the compiler returns concrete counterexamples identifying the thread, data element, and program point, enabling dense, structured feedback for targeted fixes. Invariants are verified at compile time via abstract interpretation over a layout algebra and SMT solving, with zero runtime overhead. An in-context reinforcement learning planner learns to select optimizations and synthesize effective invariants, supported by a curated knowledge base of GPU optimization techniques. We evaluate Argus on the AMD MI300X GPU across GEMM, flash attention, and MoE kernels accounting for over 90% of GPU time in LLM inference. Generated kernels achieve 99-104% of state-of-the-art hand-optimized assembly throughput and are 2-1543x faster than existing agentic systems. Argus further generalizes to 200 KernelBench tasks, solving 100% of Level 1 and 90% of Level 2 problems.

翻译：基于大语言模型的编程智能体能够生成功能正确的GPU内核，但其在矩阵乘法、注意力机制、混合专家（MoE）等关键计算任务上的性能仍远低于手工优化库。实现GPU峰值性能需要对包括分块、共享内存暂存、软件流水线和指令调度在内的紧密耦合优化进行协同推理，而现有智能体依赖稀疏的通过/失败反馈，无法诊断全局约束冲突。我们提出Argus框架，通过数据流不变量解决这一问题：即在编译时规范中编码数据在内核执行过程中的编排方式。Argus引入基于分块的类Python领域特定语言（DSL），暴露硬件指令和编译器策略，同时隐藏底层表示。该DSL提供标签函数以在数据流和控制流中传播符号注释，以及标签断言以在使用点强制实施关系约束。当违反约束时，编译器返回具体反例，标识出线程、数据元素和程序点，从而提供密集的结构化反馈以实现针对性修复。不变量通过基于布局代数的抽象解释和可满足性模理论（SMT）求解在编译时验证，且零运行时开销。上下文强化学习规划器通过精选的GPU优化技术知识库支持，学习选取优化策略并合成有效不变量。我们在AMD MI300X GPU上对通用矩阵乘法（GEMM）、Flash注意力机制和混合专家（MoE）内核进行评测，这些内核占据大语言模型推理中超过90%的GPU时间。生成的内核达到最先进手工汇编吞吐量的99-104%，且比现有智能体系统快2-1543倍。Argus进一步泛化至200个KernelBench任务，解决了100%的Level 1问题和90%的Level 2问题。