The Mixture of Experts (MoE) paradigm provides a powerful way to decompose dense layers into smaller, modular computations often more amenable to human interpretation, debugging, and editability. However, a major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization. In this paper, we propose the Multilinear Mixture of Experts ($\mu$MoE) layer to address this, focusing on vision models. $\mu$MoE layers enable scalable expert specialization by performing an implicit computation on prohibitively large weight tensors entirely in factorized form. Consequently, $\mu$MoEs (1) avoid the restrictively high inference-time costs of 'soft' MoEs, yet (2) do not inherit the training issues of the popular 'sparse' MoEs' discrete (non-differentiable) expert routing. We present both qualitative and quantitative evidence that scaling $\mu$MoE layers when fine-tuning foundation models for vision tasks leads to more specialized experts at the class-level, further enabling manual bias correction in CelebA attribute classification. Finally, we show qualitative results demonstrating the expert specialism achieved when pre-training large GPT2 and MLP-Mixer models with parameter-matched $\mu$MoE blocks at every layer, maintaining comparable accuracy. Our code is available at: https://github.com/james-oldfield/muMoE.
翻译:专家混合(MoE)范式提供了一种强大方法,可将密集层分解为更小、模块化的计算单元,这通常更易于人类理解、调试和编辑。然而,主要挑战在于将专家数量扩展至足够大规模以实现细粒度专业化所需的计算成本。本文提出多线性专家混合($\mu$MoE)层来解决这一问题,重点关注视觉模型。$\mu$MoE层通过完全以因子化形式对计算量巨大的权重张量执行隐式计算,从而实现可扩展的专家专业化。因此,$\mu$MoE既(1)避免了“软”MoE在推理时的高昂成本,又(2)未继承流行“稀疏”MoE中离散(不可微分)专家路由的训练问题。我们通过定性和定量证据表明,在为视觉任务微调基础模型时扩展$\mu$MoE层,可在类别层面产生更专业化的专家,进一步支持在CelebA属性分类中进行手动偏差校正。最后,我们展示了在每层使用参数匹配的$\mu$MoE块预训练大型GPT2和MLP-Mixer模型时获得的专家专业化定性结果,同时保持相当的准确性。代码发布于:https://github.com/james-oldfield/muMoE。