Mixture-of-experts (MoE) layers enable the scaling of transformer models while keeping the inference compute fixed. While task-expert specialization has been observed in empirical studies of frontier MoE transformer models, existing theoretical work analyzes this using continuous mixture models that cannot be used to model natural language effectively. An important open question is to \textit{theoretically explain task-expert specialization in transformer MoE models using discrete models of language}. To address this, we represent structured knowledge via syntactic templates and finite key-value dictionaries, and prove formally that a single-layer MoE transformer can encode knowledge by using experts that specialize in the corresponding tasks. Our construction shows how queries are routed to unique, task-specific experts whose size depends solely on the intrinsic complexity of the given task (i.e. the combined size of its syntactic templates and factual dictionary). Our construction provides a theoretical support for empirical results on localized knowledge circuits in MoE models. We support our theoretical findings with experiments evaluating model performance under varying MoE loss functions.
翻译:混合专家(MoE)层能够在保持推理计算量不变的同时扩展Transformer模型规模。尽管前沿MoE Transformer模型的实证研究已观察到任务-专家专业化现象,但现有理论工作使用无法有效建模自然语言的连续混合模型进行分析。一个重要的开放问题是:**利用语言离散模型在理论上解释Transformer MoE模型中的任务-专家专业化现象**。为此,我们通过句法模板和有限键值字典表示结构化知识,并正式证明单层MoE Transformer可通过使用专精于对应任务的专家来编码知识。我们的构造展示了查询如何被路由至唯一的任务特定专家,其规模仅取决于给定任务的固有复杂度(即其句法模板与事实字典的组合规模)。该构造为MoE模型中关于局部化知识回路的实证结果提供了理论支持。我们通过评估不同MoE损失函数下模型性能的实验验证了理论发现。