The Mixture of Experts (MoE) is a widely known neural architecture where an ensemble of specialized sub-models optimizes overall performance with a constant computational cost. However, conventional MoEs pose challenges at scale due to the need to store all experts in memory. In this paper, we push MoE to the limit. We propose extremely parameter-efficient MoE by uniquely combining MoE architecture with lightweight experts.Our MoE architecture outperforms standard parameter-efficient fine-tuning (PEFT) methods and is on par with full fine-tuning by only updating the lightweight experts -- less than 1% of an 11B parameters model. Furthermore, our method generalizes to unseen tasks as it does not depend on any prior task knowledge. Our research underscores the versatility of the mixture of experts architecture, showcasing its ability to deliver robust performance even when subjected to rigorous parameter constraints. Our code used in all the experiments is publicly available here: https://github.com/for-ai/parameter-efficient-moe.
翻译:混合专家模型(MoE)是一种广为人知的神经架构,通过集成多个专业化子模型,在恒定计算成本下优化整体性能。然而,传统MoE在扩展规模时面临挑战,因为需要在内存中存储所有专家。本文致力于将MoE推向极限。我们通过将MoE架构与轻量级专家独特结合,提出了极参数高效的MoE。所提出的MoE架构仅通过更新轻量级专家(不足110亿参数模型参数的1%)即可超越标准参数高效微调(PEFT)方法,并与全参数微调性能相当。此外,该方法无需依赖任何先验任务知识,因而能泛化至未知任务。本研究凸显了混合专家架构的通用性,展示了其在严苛参数约束下仍能保持稳健性能的能力。本文所有实验所用代码已公开于:https://github.com/for-ai/parameter-efficient-moe。