Mixture-of-Experts (MoE) architectures power the majority of frontier large language models, but their inference is bottlenecked by irregular memory access patterns and expert routing overhead. Existing optimized MoE kernels (Megablocks, Tutel, FasterMoE) are implemented in CUDA and locked to NVIDIA hardware. We present TritonMoE, a fused MoE dispatch kernel written entirely in OpenAI Triton that performs the complete forward pass -- router scoring, token permutation, expert GEMMs, and weighted output combination -- using only portable Triton primitives. Our key optimization is a fused gate+up GEMM kernel that computes both SwiGLU projections from shared L2-cached input tiles with in-register SiLU activation, eliminating 35% of global memory traffic. On an NVIDIA A100, TritonMoE achieves 89-131% of the throughput of the CUDA-optimized Megablocks at inference batch sizes (<= 512 tokens) across Mixtral-8x7B, DeepSeek-V3, and Qwen2-MoE configurations. All 162 correctness tests pass on both NVIDIA A100 and AMD MI300X with zero code changes, validating cross-platform portability. We additionally characterize sensitivity to routing imbalance under Zipfian-skewed expert assignments and identify the regime -- 64+ experts under extreme skew -- where our fixed-tile scheduling underperforms Megablocks' block-sparse layout, motivating dynamic block-to-expert assignment as future work. Code is available at https://github.com/bassrehab/triton-kernels.
翻译:混合专家(Mixture-of-Experts, MoE)架构支撑着大多数前沿大语言模型,但其推理过程受限于不规则内存访问模式和专家路由开销。现有优化后的MoE内核(如Megablocks、Tutel、FasterMoE)均采用CUDA实现,且仅适用于NVIDIA硬件。我们提出TritonMoE——一个完全基于OpenAI Triton编写的融合MoE调度内核,仅使用便携式Triton原语即可完成完整前向传播过程,包括路由器评分、令牌置换、专家GEMM计算及加权输出组合。我们的核心优化在于一个融合门控与升维的GEMM内核,该内核从共享L2缓存输入块中计算SwiGLU投影,并执行寄存器内SiLU激活,从而消除35%的全局内存流量。在NVIDIA A100平台上,TritonMoE在Mixtral-8x7B、DeepSeek-V3和Qwen2-MoE配置的推理批处理大小(≤512个令牌)下,可达到CUDA优化版Megablocks吞吐量的89%至131%。所有162项正确性测试在NVIDIA A100和AMD MI300X上无需任何代码修改即可通过,验证了其跨平台可移植性。此外,我们表征了在Zipf倾斜专家分配下对路由不平衡的敏感性,并识别出极端倾斜条件下(64个以上专家)固定块调度性能低于Megablocks块稀疏布局的边界区域,从而促使未来工作中采用动态块到专家分配策略。代码已开源:https://github.com/bassrehab/triton-kernels