Large Language Models (LLMs) encounter significant challenges in continual learning due to catastrophic forgetting, where new information overwrites previously acquired knowledge. This limitation leads to substantial environmental and economic waste. In this study, we introduce the PMoE, Progressive Mixture of Experts with Asymmetric Transformer, which aims to minimize forgetting by utilizing an asymmetric design with shallow layers dedicated to general knowledge and deep layers for new knowledge. PMoE incorporates progressively added experts in deep layers and a router that allocates new knowledge to the appropriate experts efficiently. The router, positioned adjacent to the deep layers, utilizes deep features aggregating consolidated information. This enables the router to perform efficiently, allocating new knowledge to the appropriate experts, which progressively increase in the deep layers. Extensive experiments on TRACE datasets and general language understanding datasets demonstrate that the proposed PMoE outperforms previous state-of-the-art approaches.
翻译:大型语言模型在持续学习中面临灾难性遗忘的重大挑战,即新信息会覆盖先前习得的知识。这一局限性导致巨大的环境与经济资源浪费。本研究提出PMoE——基于非对称Transformer的渐进式专家混合模型,其通过非对称设计将浅层网络用于通用知识、深层网络用于新知识,从而最大程度减少遗忘。PMoE在深层网络中逐步添加专家模块,并配备能高效分配新知识至相应专家的路由机制。该路由模块毗邻深层网络,利用聚合固化信息的深层特征进行决策,从而高效地将新知识分配至深层网络中逐步增多的对应专家模块。在TRACE数据集及通用语言理解数据集上的大量实验表明,所提出的PMoE模型性能优于现有最先进方法。