Mixture-of-Experts (MoE) architectures have emerged as a powerful paradigm for scaling neural networks while maintaining computational efficiency. However, standard MoE implementations rely on two rigid design assumptions: (1) fixed Top-K routing where exactly K experts are activated per token, and (2) uniform expert allocation across all layers. This paper introduces DynaMoE, a novel MoE framework that relaxes both constraints through dynamic token-level expert activation and layer-wise adaptive capacity allocation. DynaMoE introduces a principled routing mechanism where the number of active experts per token varies based on input complexity. Concurrently, the framework implements six distinct scheduling strategies for distributing expert capacity across network depth, including descending, ascending, pyramid, and wave patterns. We theoretically analyze the expressivity gains of dynamic routing and derive bounds on computational efficiency. Through extensive experiments on MNIST, Fashion-MNIST, CIFAR-10 (image classification), and Recycling-the-Web (language modeling) across multiple model scales, we demonstrate that DynaMoE achieves superior parameter efficiency compared to static baselines. Our key finding is that optimal expert schedules are task- and scale-dependent: descending schedules (concentrating capacity in early layers) outperform uniform baselines on image classification. For language modeling, optimal schedules vary by model size, descending for Tiny, ascending for Small, and uniform for Medium. Furthermore, dynamic routing reduces gradient variance during training, leading to improved convergence stability. DynaMoE establishes a new framework for adaptive computation in neural networks, providing principled guidance for MoE architecture design.
翻译:混合专家(MoE)架构已成为在保持计算效率的同时扩展神经网络的有力范式。然而,标准MoE实现依赖于两个僵化的设计假设:(1)固定的Top-K路由,即每个令牌恰好激活K个专家;(2)所有层间统一的专家分配。本文提出了DynaMoE,一种新颖的MoE框架,它通过动态令牌级专家激活和层间自适应容量分配来放宽这两个约束。DynaMoE引入了一种原则性的路由机制,其中每个令牌激活的专家数量根据输入复杂度而变化。同时,该框架实现了六种不同的调度策略,用于在网络深度上分配专家容量,包括递减、递增、金字塔和波浪模式。我们从理论上分析了动态路由的表达能力增益,并推导了计算效率的界限。通过在多个模型规模上对MNIST、Fashion-MNIST、CIFAR-10(图像分类)和Recycling-the-Web(语言建模)进行广泛实验,我们证明DynaMoE相比静态基线实现了更优的参数效率。我们的关键发现是,最优的专家调度方案依赖于具体任务和模型规模:在图像分类任务上,递减调度(将容量集中在早期层)优于统一基线。对于语言建模,最优调度方案随模型大小而变化,Tiny模型适用递减调度,Small模型适用递增调度,Medium模型适用统一调度。此外,动态路由减少了训练期间的梯度方差,从而提高了收敛稳定性。DynaMoE为神经网络中的自适应计算建立了一个新框架,为MoE架构设计提供了原则性指导。