Domain-specific adaptation is critical to maximizing the performance of pre-trained language models (PLMs) on one or multiple targeted tasks, especially under resource-constrained use cases, such as edge devices. However, existing methods often struggle to balance domain-specific performance, retention of general knowledge, and efficiency for training and inference. To address these challenges, we propose Modular Domain Experts (MoDE). MoDE is a mixture-of-experts architecture that augments a general PLMs with modular, domain-specialized experts. These experts are trained independently and composed together via a lightweight training process. In contrast to standard low-rank adaptation methods, each MoDE expert consists of several transformer layers which scale better with more training examples and larger parameter counts. Our evaluation demonstrates that MoDE achieves comparable target performances to full parameter fine-tuning while achieving 1.65% better retention performance. Moreover, MoDE's architecture enables flexible sharding configurations and improves training speeds by up to 38% over state-of-the-art distributed training configurations.
翻译:领域自适应对于最大化预训练语言模型(PLM)在单个或多个目标任务上的性能至关重要,尤其在资源受限的使用场景(如边缘设备)中。然而,现有方法往往难以在领域特定性能、通用知识保留以及训练与推理效率之间取得平衡。为解决这些挑战,我们提出了模块化领域专家(MoDE)。MoDE是一种专家混合架构,通过模块化的领域专用专家对通用PLM进行增强。这些专家被独立训练,并通过轻量级训练过程组合在一起。与标准的低秩自适应方法相比,每个MoDE专家包含若干Transformer层,其扩展性随训练样本量和参数规模的增加而更优。评估结果表明,MoDE在达到与全参数微调相当的目标性能的同时,实现了1.65%的知识保留性能提升。此外,MoDE的架构支持灵活的分片配置,相比最先进的分布式训练配置,训练速度最高可提升38%。