Pre-trained language models (PLMs) demonstrate excellent abilities to understand texts in the generic domain while struggling in a specific domain. Although continued pre-training on a large domain-specific corpus is effective, it is costly to tune all the parameters on the domain. In this paper, we investigate whether we can adapt PLMs both effectively and efficiently by only tuning a few parameters. Specifically, we decouple the feed-forward networks (FFNs) of the Transformer architecture into two parts: the original pre-trained FFNs to maintain the old-domain knowledge and our novel domain-specific adapters to inject domain-specific knowledge in parallel. Then we adopt a mixture-of-adapters gate to fuse the knowledge from different domain adapters dynamically. Our proposed Mixture-of-Domain-Adapters (MixDA) employs a two-stage adapter-tuning strategy that leverages both unlabeled data and labeled data to help the domain adaptation: i) domain-specific adapter on unlabeled data; followed by ii) the task-specific adapter on labeled data. MixDA can be seamlessly plugged into the pretraining-finetuning paradigm and our experiments demonstrate that MixDA achieves superior performance on in-domain tasks (GLUE), out-of-domain tasks (ChemProt, RCT, IMDB, Amazon), and knowledge-intensive tasks (KILT). Further analyses demonstrate the reliability, scalability, and efficiency of our method. The code is available at https://github.com/Amano-Aki/Mixture-of-Domain-Adapters.
翻译:预训练语言模型(PLM)在通用领域展现出卓越的文本理解能力,但在特定领域则表现不佳。尽管在大型领域特定语料库上进行持续预训练是有效的,但在领域上调整所有参数成本高昂。本文研究是否可以通过仅调整少量参数来实现对PLM高效且有效的适配。具体而言,我们将Transformer架构的前馈网络(FFN)解耦为两部分:原始预训练的FFN用于保持旧领域知识,以及我们新颖的领域特定适配器用于并行注入领域特定知识。然后,我们采用混合适配器门控机制来动态融合来自不同领域适配器的知识。我们提出的混合域适配器(MixDA)采用两阶段适配器调整策略,利用无标签数据和有标签数据帮助领域适配:i) 在无标签数据上训练领域特定适配器;接着 ii) 在有标签数据上训练任务特定适配器。MixDA可以无缝集成到预训练-微调范式中,我们的实验表明MixDA在域内任务(GLUE)、域外任务(ChemProt、RCT、IMDB、Amazon)以及知识密集型任务(KILT)上均取得了卓越性能。进一步分析证明了我们方法的可靠性、可扩展性和高效性。代码可在 https://github.com/Amano-Aki/Mixture-of-Domain-Adapters 获取。