Sparse mixture of expert architectures (MoEs) scale model capacity without significant increases in training or inference costs. Despite their success, MoEs suffer from a number of issues: training instability, token dropping, inability to scale the number of experts, or ineffective finetuning. In this work, we propose Soft MoE, a fully-differentiable sparse Transformer that addresses these challenges, while maintaining the benefits of MoEs. Soft MoE performs an implicit soft assignment by passing different weighted combinations of all input tokens to each expert. As in other MoEs, experts in Soft MoE only process a subset of the (combined) tokens, enabling larger model capacity (and performance) at lower inference cost. In the context of visual recognition, Soft MoE greatly outperforms dense Transformers (ViTs) and popular MoEs (Tokens Choice and Experts Choice). Furthermore, Soft MoE scales well: Soft MoE Huge/14 with 128 experts in 16 MoE layers has over 40x more parameters than ViT Huge/14, with only 2% increased inference time, and substantially better quality.
翻译:稀疏专家混合架构(MoEs)能够在不显著增加训练或推理成本的前提下扩展模型容量。尽管取得了成功,MoEs仍存在一系列问题:训练不稳定性、令牌丢弃、专家数量难以扩展以及微调效果不佳。在本研究中,我们提出了软性MoE,这是一种完全可微分的稀疏Transformer模型,能够在保持MoEs优势的同时解决上述挑战。软性MoE通过向每个专家传递所有输入令牌的不同加权组合,实现了隐式的软分配。与其他MoEs类似,软性MoE中的专家仅处理(组合后)令牌的一个子集,从而以更低的推理成本实现更大的模型容量(和性能)。在视觉识别任务中,软性MoE显著优于密集Transformer(ViTs)和主流MoEs(令牌选择与专家选择)。此外,软性MoE展现出优异的可扩展性:包含16个MoE层、128位专家的软性MoE Huge/14模型,其参数量超过ViT Huge/14的40倍,而推理时间仅增加2%,且模型质量得到实质性提升。