Relation Extraction (RE) remains a challenging task, especially when considering realistic out-of-domain evaluations. One of the main reasons for this is the limited training size of current RE datasets: obtaining high-quality (manually annotated) data is extremely expensive and cannot realistically be repeated for each new domain. An intermediate training step on data from related tasks has shown to be beneficial across many NLP tasks.However, this setup still requires supplementary annotated data, which is often not available. In this paper, we investigate intermediate pre-training specifically for RE. We exploit the affinity between syntactic structure and semantic RE, and identify the syntactic relations which are closely related to RE by being on the shortest dependency path between two entities. We then take advantage of the high accuracy of current syntactic parsers in order to automatically obtain large amounts of low-cost pre-training data. By pre-training our RE model on the relevant syntactic relations, we are able to outperform the baseline in five out of six cross-domain setups, without any additional annotated data.
翻译:关系抽取(RE)仍是一项具有挑战性的任务,尤其是在考虑现实场景中的跨领域评估时。主要原因在于当前关系抽取数据集的训练规模有限:获取高质量(人工标注)的数据成本极高,且无法为每个新领域重复进行。相关任务数据的中间训练步骤已被证明对许多自然语言处理任务有益。然而,这种设置仍需补充标注数据,而此类数据往往难以获取。本文专门研究了关系抽取的中间预训练方法。我们利用句法结构与语义关系抽取之间的关联性,通过识别两个实体间最短依存路径上的句法关系,发现与关系抽取密切相关的句法结构。进而借助当前句法分析器的高精度,自动获取大量低成本预训练数据。通过在相关句法关系上预训练关系抽取模型,我们在六个跨领域场景中的五个中实现了优于基线的性能表现,且完全无需额外标注数据。