We develop a theory of generalization and scaling for Mixture-of-Experts (MoE) Transformers that cleanly separates \emph{active} per-input capacity from routing combinatorics. By conditioning on fixed routing patterns and union-bounding across them, we derive a sup-norm covering-number bound whose metric entropy scales with the active parameter budget and incurs a MoE-specific routing overhead. Combined with a standard ERM analysis for squared loss, this yields a generalization bound under a $d$-dimensional manifold data model and $C^β$ targets, showing that approximation and estimation trade off as in dense networks once active parameters are accounted for appropriately. We further prove a constructive approximation theorem for MoE architectures, showing that, under the approximation construction, error can decrease either by scaling active capacity or by increasing the number of experts, depending on the dominant bottleneck. From these results we derive neural scaling laws for model size, data size, and compute-optimal tradeoffs. Overall, our results provide a transparent statistical reference point for reasoning about MoE scaling, clarifying which behaviors are certified by worst-case theory and which must arise from data-dependent routing structure or optimization dynamics.
翻译:我们发展了混合专家(MoE)Transformer的泛化与规模理论,该理论清晰地区分了每输入活跃容量与路由组合性。通过固定路由模式并对其取并集界,我们推导出一个上确界范数覆盖数界,其度量熵随活跃参数预算缩放,并产生MoE特有的路由开销。结合平方损失的经典经验风险最小化分析,这在一个d维流形数据模型和Cβ目标下给出了泛化界,表明一旦适当考虑活跃参数,近似与估计之间的权衡与密集网络相同。我们进一步证明了MoE架构的构造性近似定理,表明在近似构造下,误差可通过扩大活跃容量或增加专家数量来减小,具体取决于主导瓶颈。基于这些结果,我们推导出关于模型规模、数据规模以及计算最优权衡的神经规模定律。总体而言,我们的结果为理解MoE扩展提供了一个透明的统计参考点,阐明了哪些行为由最坏情况理论保证,哪些必须来源于数据依赖的路由结构或优化动力学。