Transformer-based pre-trained language models (PLMs) have achieved remarkable performance in various natural language processing (NLP) tasks. However, pre-training such models can take considerable resources that are almost only available to high-resource languages. On the contrary, static word embeddings are easier to train in terms of computing resources and the amount of data required. In this paper, we introduce MoSECroT Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer), a novel and challenging task that is especially relevant to low-resource languages for which static word embeddings are available. To tackle the task, we present the first framework that leverages relative representations to construct a common space for the embeddings of a source language PLM and the static word embeddings of a target language. In this way, we can train the PLM on source-language training data and perform zero-shot transfer to the target language by simply swapping the embedding layer. However, through extensive experiments on two classification datasets, we show that although our proposed framework is competitive with weak baselines when addressing MoSECroT, it fails to achieve competitive results compared with some strong baselines. In this paper, we attempt to explain this negative result and provide several thoughts on possible improvement.
翻译:基于Transformer的预训练语言模型(PLMs)已在多种自然语言处理(NLP)任务中取得了显著性能。然而,训练此类模型需要消耗大量计算资源,这些资源几乎只有高资源语言才能获取。相比之下,静态词嵌入在计算资源和所需数据量方面更易于训练。本文提出MoSECroT(基于静态词嵌入的模型拼接方法用于跨语言零样本迁移),这是一项新颖且富有挑战性的任务,尤其适用于拥有静态词嵌入的低资源语言。为应对该任务,我们首次提出利用相对表示构建源语言PLM嵌入与目标语言静态词嵌入的公共空间框架。通过此方法,我们可在源语言训练数据上训练PLM,并仅通过替换嵌入层实现对目标语言的零样本迁移。然而,在两个分类数据集上的大量实验表明:尽管所提框架在处理MoSECroT任务时与弱基线方法表现相当,但未能达到与强基线方法竞争的显著效果。本文尝试解释这一负面结果,并提出了若干可能的改进思路。