Mixture-of-Experts (MoE) architectures are evolving towards finer granularity to improve parameter efficiency. However, existing MoE designs face an inherent trade-off between the granularity of expert specialization and hardware execution efficiency. We propose OmniMoE, a system-algorithm co-designed framework that pushes expert granularity to its logical extreme. OmniMoE introduces vector-level Atomic Experts, enabling scalable routing and execution within a single MoE layer, while retaining a shared dense MLP branch for general-purpose processing. Although this atomic design maximizes capacity, it poses severe challenges for routing complexity and memory access. To address these, OmniMoE adopts a system-algorithm co-design: (i) a Cartesian Product Router that decomposes the massive index space to reduce routing complexity from O(N) to O(sqrt(N)); and (ii) Expert-Centric Scheduling that inverts the execution order to turn scattered, memory-bound lookups into efficient dense matrix operations. Validated on seven benchmarks, OmniMoE (with 1.7B active parameters) achieves 50.9% zero-shot accuracy across seven benchmarks, outperforming coarse-grained (e.g., DeepSeekMoE) and fine-grained (e.g., PEER) baselines. Crucially, OmniMoE reduces inference latency from 73ms to 6.7ms (a 10.9-fold speedup) compared to PEER, demonstrating that massive-scale fine-grained MoE can be fast and accurate. Our code is open-sourced at https://github.com/flash-algo/omni-moe.
翻译:混合专家(Mixture-of-Experts, MoE)架构正朝着更细粒度方向发展以提高参数效率。然而,现有MoE设计面临专家专业化粒度与硬件执行效率之间的固有权衡。我们提出OmniMoE,一个系统-算法协同设计的框架,将专家粒度推向逻辑极限。OmniMoE引入了向量级原子专家,支持在单个MoE层内进行可扩展的路由与执行,同时保留一个共享的稠密多层感知机分支用于通用处理。尽管这种原子设计最大化了模型容量,但它给路由复杂度和内存访问带来了严峻挑战。为解决这些问题,OmniMoE采用系统-算法协同设计:(i)笛卡尔积路由器,通过分解庞大的索引空间将路由复杂度从O(N)降低至O(sqrt(N));(ii)以专家为中心的调度策略,通过反转执行顺序将分散的、受内存限制的查找操作转化为高效的稠密矩阵运算。在七个基准测试上的验证表明,OmniMoE(具有17亿激活参数)在七项基准上实现了50.9%的零样本准确率,优于粗粒度(如DeepSeekMoE)和细粒度(如PEER)基线模型。关键的是,与PEER相比,OmniMoE将推理延迟从73毫秒降低至6.7毫秒(加速10.9倍),证明大规模细粒度MoE能够同时实现快速与准确。我们的代码已在https://github.com/flash-algo/omni-moe开源。