Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation

Despite the promise of Mixture of Experts (MoE) models in increasing parameter counts of Transformer models while maintaining training and inference costs, their application carries notable drawbacks. The key strategy of these models is to, for each processed token, activate at most a few experts - subsets of an extensive feed-forward layer. But this approach is not without its challenges. The operation of matching experts and tokens is discrete, which makes MoE models prone to issues like training instability and uneven expert utilization. Existing techniques designed to address these concerns, such as auxiliary losses or balance-aware matching, result either in lower model performance or are more difficult to train. In response to these issues, we propose Mixture of Tokens, a fully-differentiable model that retains the benefits of MoE architectures while avoiding the aforementioned difficulties. Rather than routing tokens to experts, this approach mixes tokens from different examples prior to feeding them to experts, enabling the model to learn from all token-expert combinations. Importantly, this mixing can be disabled to avoid mixing of different sequences during inference. Crucially, this method is fully compatible with both masked and causal Large Language Model training and inference.

翻译：尽管混合专家（MoE）模型在增加Transformer模型参数数量的同时保持训练和推理成本方面具有前景，但它们的应用存在显著缺陷。这些模型的核心策略是：对于每个处理的令牌，最多激活少量专家——即广泛前馈层的子集。然而，这种方法并非没有挑战。专家与令牌的匹配操作是离散的，这使得MoE模型容易受到训练不稳定和专家利用率不均等问题的影响。为应对这些挑战而设计的现有技术（如辅助损失或平衡感知匹配）要么导致模型性能下降，要么使其训练更加困难。针对这些问题，我们提出了令牌混合模型——一种完全可微分的模型，它保留了MoE架构的优势，同时避免了上述困难。该方法不是将令牌路由到专家，而是在将令牌输入专家之前混合不同样本的令牌，从而使模型能够从所有令牌-专家组合中学习。重要的是，这种混合可以在推理时禁用，以避免不同序列的混合。关键的是，该方法与掩码和因果大型语言模型的训练与推理完全兼容。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日