Relation extraction (RE) is a fundamental task in information extraction, whose extension to multilingual settings has been hindered by the lack of supervised resources comparable in size to large English datasets such as TACRED (Zhang et al., 2017). To address this gap, we introduce the MultiTACRED dataset, covering 12 typologically diverse languages from 9 language families, which is created by machine-translating TACRED instances and automatically projecting their entity annotations. We analyze translation and annotation projection quality, identify error categories, and experimentally evaluate fine-tuned pretrained mono- and multilingual language models in common transfer learning scenarios. Our analyses show that machine translation is a viable strategy to transfer RE instances, with native speakers judging more than 84% of the translated instances to be linguistically and semantically acceptable. We find monolingual RE model performance to be comparable to the English original for many of the target languages, and that multilingual models trained on a combination of English and target language data can outperform their monolingual counterparts. However, we also observe a variety of translation and annotation projection errors, both due to the MT systems and linguistic features of the target languages, such as pronoun-dropping, compounding and inflection, that degrade dataset quality and RE model performance.
翻译:关系抽取(RE)是信息抽取中的基本任务,但其在多语言场景下的扩展一直受到缺乏与大型英文数据集(如TACRED, Zhang et al., 2017)规模相当的监督资源的制约。为弥补这一空白,我们提出了MultiTACRED数据集,涵盖来自9个语系的12种类型多样的语言。该数据集通过机器翻译TACRED实例并自动投射其实体标注构建而成。我们分析了翻译与标注投射的质量,识别了错误类别,并在常见的迁移学习场景中对微调后的预训练单语言和多语言语言模型进行了实验评估。分析表明,机器翻译是迁移RE实例的有效策略,母语者认为超过84%的翻译实例在语言和语义上可接受。我们发现,对于许多目标语言,单语言RE模型性能与原始英文相当,且基于英文与目标语言数据联合训练的多语言模型可超越其单语言对应模型。然而,我们也观察到多种因机器翻译系统以及目标语言的语言特征(如代词省略、复合词构成和屈折变化)导致的翻译与标注投射错误,这些错误降低了数据集质量和RE模型性能。