从循环嵌套到硅片：使用MLIR-AIR将AI工作负载映射至AMD NPU (From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR)

Erwei Wang,Samuel Bayliss,Andra Bisca,Zachary Blair,Sangeeta Chowdhary,Kristof Denolf,Jeff Fifield,Brandon Freiberger,Erika Hunhoff,Phil James-Roxby,Jack Lo,Joseph Melber,Stephen Neuendorffer,Eddie Richter,Andre Rosti,Javier Setoain,Gagandeep Singh,Endri Taka,Pranathi Vasireddy,Zhewen Yu,Niansong Zhang,Jinming Zhuang

General-purpose compilers abstract away parallelism, locality, and synchronization, limiting their effectiveness on modern spatial architectures. As modern computing architectures increasingly rely on fine-grained control over data movement, execution order, and compute placement for performance, compiler infrastructure must provide explicit mechanisms for orchestrating compute and data to fully exploit such architectures. We introduce MLIR-AIR, a novel, open-source compiler stack built on MLIR that bridges the semantic gap between high-level workloads and fine-grained spatial architectures such as AMD's NPUs. MLIR-AIR defines the AIR dialect, which provides structured representations for asynchronous and hierarchical operations across compute and memory resources. AIR primitives allow the compiler to orchestrate spatial scheduling, distribute computation across hardware regions, and overlap communication with computation without relying on ad hoc runtime coordination or manual scheduling. We demonstrate MLIR-AIR's capabilities through two case studies: matrix multiplication and the multi-head attention block from the LLaMA 2 model. For matrix multiplication, MLIR-AIR achieves up to 78.7% compute efficiency and generates implementations with performance almost identical to state-of-the-art, hand-optimized matrix multiplication written using the lower-level, close-to-metal MLIR-AIE framework. For multi-head attention, we demonstrate that the AIR interface supports fused implementations using approximately 150 lines of code, enabling tractable expression of complex workloads with efficient mapping to spatial hardware. MLIR-AIR transforms high-level structured control flow into spatial programs that efficiently utilize the compute fabric and memory hierarchy of an NPU, leveraging asynchronous execution, tiling, and communication overlap through compiler-managed scheduling.

翻译：通用编译器对并行性、局部性和同步机制进行抽象，限制了其在现代空间架构上的有效性。随着现代计算架构日益依赖对数据移动、执行顺序和计算布局的细粒度控制以提升性能，编译器基础设施必须提供显式机制来协调计算与数据，以充分利用此类架构。本文介绍MLIR-AIR——一个基于MLIR构建的新型开源编译器栈，它弥合了高层次工作负载与细粒度空间架构（如AMD NPU）之间的语义鸿沟。MLIR-AIR定义了AIR方言，为跨计算与内存资源的异步分层操作提供结构化表示。AIR原语使编译器能够协调空间调度、将计算分布至硬件区域，并实现通信与计算的重叠，而无需依赖临时运行时协调或手动调度。我们通过两个案例研究展示MLIR-AIR的能力：矩阵乘法和LLaMA 2模型中的多头注意力模块。对于矩阵乘法，MLIR-AIR实现了高达78.7%的计算效率，其生成的实现性能与使用更低层次、贴近硬件的MLIR-AIE框架编写的最先进手工优化矩阵乘法几乎相同。对于多头注意力，我们证明AIR接口支持使用约150行代码实现融合计算，从而能以可管理的方式表达复杂工作负载，并高效映射至空间硬件。MLIR-AIR将高层次结构化控制流转换为空间程序，通过编译器管理的调度利用异步执行、分块和通信重叠，高效利用NPU的计算结构和内存层次。