AscendCraft: Automatic Ascend NPU Kernel Generation via DSL-Guided Transcompilation

The performance of deep learning models critically depends on efficient kernel implementations, yet developing high-performance kernels for specialized accelerators remains time-consuming and expertise-intensive. While recent work demonstrates that large language models (LLMs) can generate correct and performant GPU kernels, kernel generation for neural processing units (NPUs) remains largely underexplored due to domain-specific programming models, limited public examples, and sparse documentation. Consequently, directly generating AscendC kernels with LLMs yields extremely low correctness, highlighting a substantial gap between GPU and NPU kernel generation. We present AscendCraft, a DSL-guided approach for automatic AscendC kernel generation. AscendCraft introduces a lightweight DSL that abstracts non-essential complexity while explicitly modeling Ascend-specific execution semantics. Kernels are first generated in the DSL using category-specific expert examples and then transcompiled into AscendC through structured, constraint-driven LLM lowering passes. Evaluated on MultiKernelBench across seven operator categories, AscendCraft achieves 98.1% compilation success and 90.4% functional correctness. Moreover, 46.2% of generated kernels match or exceed PyTorch eager execution performance, demonstrating that DSL-guided transcompilation can enable LLMs to generate both correct and competitive NPU kernels. Beyond benchmarks, AscendCraft further demonstrates its generality by successfully generating two correct kernels for newly proposed mHC architecture, achieving performance that substantially surpasses PyTorch eager execution.

翻译：深度学习模型的性能关键取决于高效的内核实现，然而为专用加速器开发高性能内核仍然耗时且需要专业知识。尽管近期研究表明大型语言模型（LLM）能够生成正确且高性能的GPU内核，但由于领域特定的编程模型、有限的公开示例和稀疏的文档支持，面向神经处理单元（NPU）的内核生成研究仍处于探索不足的状态。因此，直接使用LLM生成AscendC内核的正确率极低，这凸显了GPU与NPU内核生成之间的显著差距。本文提出AscendCraft，一种基于领域特定语言（DSL）引导的AscendC内核自动生成方法。AscendCraft引入了一种轻量级DSL，该语言在抽象非必要复杂性的同时，显式建模了昇腾芯片特有的执行语义。内核首先通过类别特定的专家示例在DSL中生成，随后通过结构化、约束驱动的LLM降级转换过程转编译为AscendC代码。在涵盖七类算子的MultiKernelBench基准测试中，AscendCraft实现了98.1%的编译成功率和90.4%的功能正确率。此外，46.2%的生成内核达到或超越了PyTorch即时执行的性能，证明DSL引导的转编译方法能使LLM生成既正确又具备竞争力的NPU内核。除基准测试外，AscendCraft通过成功为新提出的mHC架构生成两个正确内核，进一步证明了其泛化能力，所生成内核的性能显著超越PyTorch即时执行。