Fine-tuning the entire set of parameters of a large pretrained model has become the mainstream approach for transfer learning. To increase its efficiency and prevent catastrophic forgetting and interference, techniques like adapters and sparse fine-tuning have been developed. Adapters are modular, as they can be combined to adapt a model towards different facets of knowledge (e.g., dedicated language and/or task adapters). Sparse fine-tuning is expressive, as it controls the behavior of all model components. In this work, we introduce a new fine-tuning method with both these desirable properties. In particular, we learn sparse, real-valued masks based on a simple variant of the Lottery Ticket Hypothesis. Task-specific masks are obtained from annotated data in a source language, and language-specific masks from masked language modeling in a target language. Both these masks can then be composed with the pretrained model. Unlike adapter-based fine-tuning, this method neither increases the number of parameters at inference time nor alters the original model architecture. Most importantly, it outperforms adapters in zero-shot cross-lingual transfer by a large margin in a series of multilingual benchmarks, including Universal Dependencies, MasakhaNER, and AmericasNLI. Based on an in-depth analysis, we additionally find that sparsity is crucial to prevent both 1) interference between the fine-tunings to be composed and 2) overfitting. We release the code and models at https://github.com/cambridgeltl/composable-sft.
翻译:微调大型预训练模型的全部参数已成为迁移学习的主流方法。为提升其效率并防止灾难性遗忘和干扰,研究者们开发了适配器与稀疏微调等技术。适配器具有模块化特性,可通过组合方式使模型适应不同知识维度(例如专用的语言和/或任务适配器)。稀疏微调则具有强表达能力,能够调控所有模型组件的行为。本文提出一种兼具上述两种理想特性的新型微调方法。具体而言,我们基于彩票假设的简单变体学习稀疏实值掩码:任务特异性掩码通过源语言的标注数据获得,语言特异性掩码则通过目标语言的掩码语言建模获得。这两种掩码可与预训练模型进行组合。与基于适配器的微调不同,本方法既不会增加推理时的参数数量,也不会改变原始模型架构。最重要的是,在包括Universal Dependencies、MasakhaNER和AmericansNLI等一系列多语言基准测试中,本方法在零样本跨语言迁移任务上大幅超越适配器方法。基于深度分析,我们还发现稀疏性对于防止以下两方面问题至关重要:1)待组合微调间的相互干扰;2)过拟合。代码与模型已开源至https://github.com/cambridgeltl/composable-sft。