HyperParallel-MoE: Multi-Core Interleaved Scheduling for Fast MoE Training on Ascend NPUs

Modern Mixture-of-Experts (MoE) models increasingly rely on large-scale AI accelerator clusters for efficient training. Ascend NPUs expose heterogeneous on-chip compute resources, including matrix-oriented AIC units and vector-oriented AIV units with explicit cross-queue synchronization support. However, existing training frameworks largely execute MoE operators in a serialized kernel-by-kernel manner, leaving substantial heterogeneous parallelism underutilized. This paper presents HyperParallel-MoE, a compilation and scheduling framework for MoE training on Ascend NPUs. HyperParallel-MoE transforms operator-level MoE execution into a statically scheduled tile-level heterogeneous taskflow spanning AIC and AIV resources. It introduces AIV-driven one-sided communication to eliminate host-side collective synchronization, dependency-preserving tile task generation to unify communication and computation under a common task abstraction, and event-driven static scheduling to coordinate cross-queue execution with low runtime overhead. HyperParallel-MoE further executes the compiled taskflow within a unified runtime that concurrently drives AIC and AIV workers inside a single kernel launch, enabling fine-grained overlap among communication, matrix computation, and vector computation while preserving existing optimized operators. We implement HyperParallel-MoE in the MindSpore and MindFormers stack and evaluate it using DeepSeek-style MoE models on Ascend A3 clusters. Across multiple expert-parallel configurations, HyperParallel-MoE reduces Dispatch-to-Combine MoE-FFN latency by up to 1.58x, demonstrating that tile-level heterogeneous scheduling can substantially improve MoE training efficiency on modern NPUs. The source code is available at https://gitcode.com/mindspore/hyper-parallel/tree/master/hyper_parallel/core/multicore

翻译：现代混合专家（Mixture-of-Experts, MoE）模型日益依赖大规模AI加速器集群进行高效训练。昇腾NPU展现出异构的片上计算资源，包括面向矩阵的AIC单元和面向向量的AIV单元，并支持显式的跨队列同步机制。然而，现有训练框架大多以串行的逐内核方式执行MoE算子，导致大量异构并行性未被充分利用。本文提出HyperParallel-MoE——一种面向昇腾NPU的MoE训练编译与调度框架。HyperParallel-MoE将算子级MoE执行转化为静态调度的瓦片级异构任务流，该任务流横跨AIC与AIV资源。它引入AIV驱动的单边通信以消除主机端集体同步、保持依赖关系的瓦片任务生成以统一通信与计算于共同的任务抽象之下，以及事件驱动的静态调度以低运行时开销协调跨队列执行。HyperParallel-MoE进一步在统一运行时中执行编译后的任务流，该运行时可在单次内核启动中并发驱动AIC与AIV工作单元，从而在保持现有优化算子的同时，实现通信、矩阵计算与向量计算间的细粒度重叠。我们在MindSpore与MindFormers技术栈中实现HyperParallel-MoE，并在昇腾A3集群上使用DeepSeek风格的MoE模型进行评估。在多种专家并行配置下，HyperParallel-MoE将Dispatch-to-Combine MoE-FFN延迟降低最高达1.58倍，这表明瓦片级异构调度可显著提升现代NPU上的MoE训练效率。源代码见https://gitcode.com/mindspore/hyper-parallel/tree/master/hyper_parallel/core/multicore