Efficiency, specialization, and adaptability to new data distributions are qualities that are hard to combine in current Large Language Models. The Mixture of Experts (MoE) architecture has been the focus of significant research because its inherent conditional computation enables such desirable properties. In this work, we focus on "upcycling" dense expert models into an MoE, aiming to improve specialization while also adding the ability to adapt to new tasks easily. We introduce Nexus, an enhanced MoE architecture with adaptive routing where the model learns to project expert embeddings from domain representations. This approach allows Nexus to flexibly add new experts after the initial upcycling through separately trained dense models, without requiring large-scale MoE training for unseen data domains. Our experiments show that Nexus achieves a relative gain of up to 2.1% over the baseline for initial upcycling, and a 18.8% relative gain for extending the MoE with a new expert by using limited finetuning data. This flexibility of Nexus is crucial to enable an open-source ecosystem where every user continuously assembles their own MoE-mix according to their needs.
翻译:效率、专业化以及对新数据分布的适应性是当前大型语言模型难以兼顾的特性。专家混合(MoE)架构因其固有的条件计算特性能够实现这些理想属性,已成为重要研究方向。本研究聚焦于将稠密专家模型"升级改造"为MoE架构,旨在提升专业化的同时增强模型对新任务的适应能力。我们提出Nexus——一种采用自适应路由机制的增强型MoE架构,该模型能够从领域表征中学习专家嵌入的投影方法。这种设计使Nexus在完成初始升级改造后,可通过独立训练的稠密模型灵活添加新专家,无需针对未见数据域进行大规模MoE训练。实验表明:在初始升级改造阶段,Nexus较基线模型获得最高2.1%的相对性能提升;在利用有限微调数据扩展MoE新专家时,实现18.8%的相对性能增益。Nexus的这种灵活性对于构建开源生态系统至关重要,使得每位用户都能根据需求持续组装定制化的MoE混合模型。