MoSECroT: Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer

Transformer-based pre-trained language models (PLMs) have achieved remarkable performance in various natural language processing (NLP) tasks. However, pre-training such models can take considerable resources that are almost only available to high-resource languages. On the contrary, static word embeddings are easier to train in terms of computing resources and the amount of data required. In this paper, we introduce MoSECroT Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer), a novel and challenging task that is especially relevant to low-resource languages for which static word embeddings are available. To tackle the task, we present the first framework that leverages relative representations to construct a common space for the embeddings of a source language PLM and the static word embeddings of a target language. In this way, we can train the PLM on source-language training data and perform zero-shot transfer to the target language by simply swapping the embedding layer. However, through extensive experiments on two classification datasets, we show that although our proposed framework is competitive with weak baselines when addressing MoSECroT, it fails to achieve competitive results compared with some strong baselines. In this paper, we attempt to explain this negative result and provide several thoughts on possible improvement.

翻译：基于Transformer的预训练语言模型（PLMs）已在多种自然语言处理（NLP）任务中取得了显著性能。然而，训练此类模型需要消耗大量计算资源，这些资源几乎只有高资源语言才能获取。相比之下，静态词嵌入在计算资源和所需数据量方面更易于训练。本文提出MoSECroT（基于静态词嵌入的模型拼接方法用于跨语言零样本迁移），这是一项新颖且富有挑战性的任务，尤其适用于拥有静态词嵌入的低资源语言。为应对该任务，我们首次提出利用相对表示构建源语言PLM嵌入与目标语言静态词嵌入的公共空间框架。通过此方法，我们可在源语言训练数据上训练PLM，并仅通过替换嵌入层实现对目标语言的零样本迁移。然而，在两个分类数据集上的大量实验表明：尽管所提框架在处理MoSECroT任务时与弱基线方法表现相当，但未能达到与强基线方法竞争的显著效果。本文尝试解释这一负面结果，并提出了若干可能的改进思路。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日