Pretrained language models (PLMs) are today the primary model for natural language processing. Despite their impressive downstream performance, it can be difficult to apply PLMs to new languages, a barrier to making their capabilities universally accessible. While prior work has shown it possible to address this issue by learning a new embedding layer for the new language, doing so is both data and compute inefficient. We propose to use an active forgetting mechanism during pretraining, as a simple way of creating PLMs that can quickly adapt to new languages. Concretely, by resetting the embedding layer every K updates during pretraining, we encourage the PLM to improve its ability of learning new embeddings within a limited number of updates, similar to a meta-learning effect. Experiments with RoBERTa show that models pretrained with our forgetting mechanism not only demonstrate faster convergence during language adaptation but also outperform standard ones in a low-data regime, particularly for languages that are distant from English.
翻译:预训练语言模型(PLMs)是当今自然语言处理的主要模型。尽管在下游任务中表现优异,但将PLMs应用于新语言仍存在困难,这成为其通用性普及的障碍。虽然先前研究表明可通过为新语言学习新的嵌入层来解决这一问题,但这种方法在数据和计算效率上均较低。我们提出在预训练过程中采用主动遗忘机制,作为一种简单的方式创建能快速适应新语言的PLMs。具体而言,通过在预训练期间每K次更新重置嵌入层,我们鼓励PLM在有限次更新内提升学习新嵌入的能力,类似于元学习效应。在RoBERTa上的实验表明,采用本文遗忘机制预训练的模型不仅在语言适应过程中展现出更快的收敛速度,而且在低数据场景下(尤其针对与英语差异较大的语言)也优于标准模型。