Relation Extraction (RE) is a task that identifies relationships between entities in a text, enabling the acquisition of relational facts and bridging the gap between natural language and structured knowledge. However, current RE models often rely on small datasets with low coverage of relation types, particularly when working with languages other than English. In this paper, we address the above issue and provide two new resources that enable the training and evaluation of multilingual RE systems. First, we present SRED$^{\rm FM}$, an automatically annotated dataset covering 18 languages, 400 relation types, 13 entity types, totaling more than 40 million triplet instances. Second, we propose RED$^{\rm FM}$, a smaller, human-revised dataset for seven languages that allows for the evaluation of multilingual RE systems. To demonstrate the utility of these novel datasets, we experiment with the first end-to-end multilingual RE model, mREBEL, that extracts triplets, including entity types, in multiple languages. We release our resources and model checkpoints at https://www.github.com/babelscape/rebel
翻译:关系抽取(Relation Extraction, RE)是一项识别文本中实体间关系的任务,能够获取关系事实并弥合自然语言与结构化知识之间的鸿沟。然而,当前的关系抽取模型通常依赖关系类型覆盖率较低的小规模数据集,尤其在处理非英语语言时更为突出。本文针对上述问题,提供了两种新资源,可用于训练和评估多语言关系抽取系统。首先,我们提出SRED$^{\rm FM}$,一个覆盖18种语言、400种关系类型、13种实体类型的自动标注数据集,包含超过4000万个三元组实例。其次,我们提出RED$^{\rm FM}$,一个针对七种语言的小规模人工修订数据集,可用于多语言关系抽取系统的评估。为验证这些新数据集的实用性,我们实验了首个端到端多语言关系抽取模型mREBEL,该模型可提取多语言实体类型及三元组。我们已在https://www.github.com/babelscape/rebel 公开资源与模型检查点。