Although applying Mixture of Experts to large language models for learning new tasks is widely regarded as an effective strategy for continuous learning, there still remain two major challenges: (1) As the number of tasks grows, simple parameter expansion strategies can lead to excessively large models. (2) Modifying the parameters of the existing router results in the erosion of previously acquired knowledge. In this paper, we present an innovative framework named LLaVA-CMoE, which is a continuous Mixture of Experts (MoE) architecture without any replay data. Specifically, we have developed a method called Probe-Guided Knowledge Extension (PGKE), which employs probe experts to assess whether additional knowledge is required for a specific layer. This approach enables the model to adaptively expand its network parameters based on task distribution, thereby significantly improving the efficiency of parameter expansion. Additionally, we introduce a hierarchical routing algorithm called Probabilistic Task Locator (PTL), where high-level routing captures inter-task information and low-level routing focuses on intra-task details, ensuring that new task experts do not interfere with existing ones. Our experiments shows that our efficient architecture has substantially improved model performance on the Coin benchmark while maintaining a reasonable parameter count.
翻译:尽管将专家混合应用于大规模语言模型以学习新任务被广泛视为持续学习的有效策略,但仍存在两大挑战:(1)随着任务数量增加,简单的参数扩展策略会导致模型规模过度膨胀;(2)修改现有路由器的参数会导致先前习得知识的侵蚀。本文提出了一种名为LLaVA-CMoE的创新框架,这是一种无需任何回放数据的持续专家混合架构。具体而言,我们开发了一种名为探针引导知识扩展的方法,该方法通过探针专家评估特定网络层是否需要补充知识。该策略使模型能够根据任务分布自适应扩展网络参数,从而显著提升参数扩展效率。此外,我们引入了一种名为概率任务定位器的分层路由算法,其中高层路由捕获任务间信息,低层路由聚焦于任务内部细节,确保新任务专家不会干扰现有专家。实验表明,我们的高效架构在Coin基准测试中显著提升了模型性能,同时保持了合理的参数量。