We present the design and implementation of PolyBlocks, a modular and reusable MLIR-based compiler infrastructure for AI programming frameworks and AI chips. PolyBlocks is based on pass pipelines that compose transformations on loop nests and SSA, primarily relying on lightweight affine access analysis; the transformations are stitched together in specialized ways to realize high-performance code automatically by the use of analytical cost models and heuristics. The optimizations in these passes include multi-level tiling, fusion, on-chip scratchpad usage, mapping matmuls and convolutions to matrix units, fusing the attention layer, and several other transformations for parallelism and locality. They have been developed in a way that makes it easy to build PolyBlocks-based compilers to target new chips, reusing much of the infrastructure. PolyBlocks' design and architecture enable fully automatic code generation from high-level frameworks to low-level target-specific intrinsics. Experimental results from evaluating PolyBlocks-powered just-in-time compilation for PyTorch and JAX targeting NVIDIA GPUs show that it is able to match or outperform Torch Inductor and XLA in several cases, although the latter rely on a combination of vendor libraries and code generation. For individual operators like matmuls and convolutions, PolyBlocks-generated code is competitive with the best vendor-tuned libraries or hand-written kernels.
翻译:本文介绍了PolyBlocks的设计与实现——一个基于MLIR、模块化且可复用的编译器基础设施,适用于AI编程框架与AI芯片。PolyBlocks基于以特定方式组合循环嵌套与静态单赋值形式变换的编译过程流水线,主要依赖轻量级仿射访问分析;这些变换通过分析成本模型与启发式方法自动生成高性能代码。这些编译过程中的优化包括多级分块、融合、片上暂存器使用、将矩阵乘法与卷积映射至矩阵运算单元、注意力层融合,以及多种面向并行性与局部性的其他变换。其开发方式使得基于PolyBlocks构建面向新型芯片的编译器变得简便,同时可复用大部分基础设施。PolyBlocks的设计与架构支持从高级框架到底层目标专用内联代码的全自动代码生成。针对NVIDIA GPU的PyTorch与JAX即时编译评估实验表明,尽管Torch Inductor与XLA依赖于厂商库与代码生成的组合,PolyBlocks在多种场景下仍能匹配或超越其性能。对于矩阵乘法与卷积等独立算子,PolyBlocks生成的代码可与最优厂商调优库或手写内核相竞争。