Cross-lingual transfer-learning is widely used in Event Extraction for low-resource languages and involves a Multilingual Language Model that is trained in a source language and applied to the target language. This paper studies whether the typological similarity between source and target languages impacts the performance of cross-lingual transfer, an under-explored topic. We first focus on Basque as the target language, which is an ideal target language because it is typologically different from surrounding languages. Our experiments on three Event Extraction tasks show that the shared linguistic characteristic between source and target languages does have an impact on transfer quality. Further analysis of 72 language pairs reveals that for tasks that involve token classification such as entity and event trigger identification, common writing script and morphological features produce higher quality cross-lingual transfer. In contrast, for tasks involving structural prediction like argument extraction, common word order is the most relevant feature. In addition, we show that when increasing the training size, not all the languages scale in the same way in the cross-lingual setting. To perform the experiments we introduce EusIE, an event extraction dataset for Basque, which follows the Multilingual Event Extraction dataset (MEE). The dataset and code are publicly available.
翻译:跨语言迁移学习广泛用于低资源语言的事件抽取任务,其核心在于使用多语言语言模型在源语言上训练并应用于目标语言。本文旨在研究源语言与目标语言之间的类型学相似性对跨语言迁移效果的影响——这是一个尚未充分探索的课题。我们首先以巴斯克语为目标语言,因其与周边语言存在显著类型学差异,是理想的研究对象。在三个事件抽取任务上的实验表明,源语言与目标语言的共同语言特征确实影响迁移质量。基于72种语言对的进一步分析显示:在涉及词元分类的任务(如实体与事件触发词识别)中,共同书写系统与形态学特征能产生更优质的跨语言迁移;而在涉及结构预测任务(如论元抽取)中,共同语序是最关键的特征。此外,我们发现增加训练数据规模时,不同语言在跨语言场景下的性能提升幅度并不一致。为开展实验,我们构建了巴斯克语事件抽取数据集EusIE,该数据集遵循多语言事件抽取数据集(MEE)规范。数据集与代码均已公开。