Relation extraction (RE) is a fundamental task in information extraction, whose extension to multilingual settings has been hindered by the lack of supervised resources comparable in size to large English datasets such as TACRED (Zhang et al., 2017). To address this gap, we introduce the MultiTACRED dataset, covering 12 typologically diverse languages from 9 language families, which is created by machine-translating TACRED instances and automatically projecting their entity annotations. We analyze translation and annotation projection quality, identify error categories, and experimentally evaluate fine-tuned pretrained mono- and multilingual language models in common transfer learning scenarios. Our analyses show that machine translation is a viable strategy to transfer RE instances, with native speakers judging more than 83% of the translated instances to be linguistically and semantically acceptable. We find monolingual RE model performance to be comparable to the English original for many of the target languages, and that multilingual models trained on a combination of English and target language data can outperform their monolingual counterparts. However, we also observe a variety of translation and annotation projection errors, both due to the MT systems and linguistic features of the target languages, such as pronoun-dropping, compounding and inflection, that degrade dataset quality and RE model performance.
翻译:关系抽取(Relation Extraction, RE)是信息抽取中的基础任务,但由于缺乏与TACRED(Zhang等,2017)等大型英文数据集规模相当的有监督资源,该任务向多语言环境的扩展一直受到阻碍。为弥补这一空白,我们推出了MultiTACRED数据集,覆盖来自9个语系的12种类型多样的语言。该数据集通过对TACRED实例进行机器翻译并自动投影其实体标注而构建。我们分析了翻译与标注投影的质量,识别了错误类别,并在常见的迁移学习场景中实验评估了微调后的预训练单语及多语言语言模型。我们的分析表明,机器翻译是将RE实例迁移的可行策略,超过83%的翻译实例被母语者判定为语言和语义上可接受。我们发现,对于许多目标语言,单语RE模型的性能与英文原版相当;而基于英文与目标语言数据联合训练的多语言模型可超越其单语版本。然而,我们也观察到多种翻译与标注投影错误——既源于机器翻译系统,也源于目标语言的代词省略、复合词构成及词形屈折等语言学特征——这些错误降低了数据集质量与RE模型性能。