ModuleFormer: Learning Modular Large Language Models From Uncurated Data

Large Language Models (LLMs) have achieved remarkable results. But existing models are expensive to train and deploy, and it is also difficult to expand their knowledge beyond pre-training data without forgetting previous knowledge. This paper proposes a new neural network architecture, ModuleFormer, that leverages modularity to improve the efficiency and flexibility of large language models. ModuleFormer is based on the Sparse Mixture of Experts (SMoE). Unlike the previous SMoE-based modular language model [Gururangan et al., 2021], which requires domain-labeled data to learn domain-specific experts, ModuleFormer can induce modularity from uncurated data with its new load balancing and load concentration losses. ModuleFormer is a modular architecture that includes two different types of modules, new stick-breaking attention heads, and feedforward experts. Different modules are sparsely activated conditions on the input token during training and inference. In our experiment, we found that the modular architecture enables three important abilities for large pre-trained language models: 1) Efficiency, since ModuleFormer only activates a subset of its modules for each input token, thus it could achieve the same performance as dense LLMs with more than two times throughput; 2) Extendability, ModuleFormer is more immune to catastrophic forgetting than dense LLMs and can be easily extended with new modules to learn new knowledge that is not included in the training data; 3) Specialisation, finetuning ModuleFormer could specialize a subset of modules to the finetuning task, and the task-unrelated modules could be easily pruned for a lightweight deployment.

翻译：摘要：大型语言模型（LLMs）已取得显著成果，但现有模型在训练和部署上成本高昂，且难以在避免遗忘先前知识的同时扩展预训练数据之外的知识。本文提出一种新型神经网络架构——ModuleFormer，通过模块化设计提升大型语言模型的效率与灵活性。ModuleFormer基于稀疏混合专家（SMoE）架构。与先前需依赖领域标注数据来学习领域特定专家模块的SMoE模块化语言模型[Gururangan等人，2021]不同，ModuleFormer通过其新型的负载均衡损失与负载集中损失，可从非整理数据中诱导出模块化特性。该架构包含两种不同类型的模块：新型的折棍注意力头与前馈专家模块。在训练和推理过程中，不同模块会根据输入令牌进行稀疏激活。实验表明，这种模块化架构使大规模预训练语言模型具备三种关键能力：1）高效性——由于ModuleFormer对每个输入令牌仅激活其部分模块，因此在吞吐量提升两倍以上的情况下仍能达到与稠密LLMs相同的性能；2）可扩展性——相较于稠密LLMs，ModuleFormer对灾难性遗忘具有更强的免疫力，且可通过添加新模块轻松扩展以学习训练数据中未包含的新知识；3）专业化——对ModuleFormer进行微调可使部分模块专门化处理特定任务，而任务无关模块可被轻松剪枝以实现轻量化部署。