In this paper, we present Hexagon-MLIR,an open-source compilation stack that targets Qualcomm Hexagon Neural Processing Unit (NPU) and provides unified support for lowering Triton kernels and PyTorch models . Built using the MLIR framework, our compiler applies a structured sequence of passes to exploit NPU architectural features to accelerate AI workloads. It enables faster deployment of new Triton kernels (hand-written or subgraphs from PyTorch 2.0), for our target by providing automated compilation from kernel to binary. By ingesting Triton kernels, we generate mega-kernels that maximize data locality in the NPU's Tightly Coupled Memory (TCM), reducing the bandwidth bottlenecks inherent in library-based approaches. This initiative complements our commercial toolchains by providing developers with an open-source MLIR-based compilation stack that gives them a path to advance AI compilation capabilities through a more flexible approach. Hexagon-MLIR is a work-in-progress, and we are continuing to add many more optimizations and capabilities in this effort.
翻译:本文介绍了Hexagon-MLIR,一个面向高通Hexagon神经处理单元(NPU)的开源编译栈,它为降低Triton内核和PyTorch模型提供了统一支持。我们的编译器基于MLIR框架构建,应用一系列结构化的编译流程,以利用NPU的架构特性来加速AI工作负载。通过提供从内核到二进制文件的自动化编译,它为目标平台实现了新Triton内核(手写或来自PyTorch 2.0的子图)的快速部署。通过处理Triton内核,我们生成了在NPU紧耦合存储器(TCM)中最大化数据局部性的巨型内核,从而减少了基于库的方法固有的带宽瓶颈。这一举措是对我们商业工具链的补充,为开发者提供了一个基于MLIR的开源编译栈,使他们能够通过更灵活的方法推进AI编译能力。Hexagon-MLIR是一个持续开发的项目,我们正在不断为此添加更多优化和功能。