Extracting structured and grounded fact triples from raw text is a fundamental task in Information Extraction (IE). Existing IE datasets are typically collected from Wikipedia articles, using hyperlinks to link entities to the Wikidata knowledge base. However, models trained only on Wikipedia have limitations when applied to web domains, which often contain noisy text or text that does not have any factual information. We present WebIE, the first large-scale, entity-linked closed IE dataset consisting of 1.6M sentences automatically collected from the English Common Crawl corpus. WebIE also includes negative examples, i.e. sentences without fact triples, to better reflect the data on the web. We annotate ~21K triples from WebIE through crowdsourcing and introduce mWebIE, a translation of the annotated set in four other languages: French, Spanish, Portuguese, and Hindi. We evaluate the in-domain, out-of-domain, and zero-shot cross-lingual performance of generative IE models and find models trained on WebIE show better generalisability. We also propose three training strategies that use entity linking as an auxiliary task. Our experiments show that adding Entity-Linking objectives improves the faithfulness of our generative IE models.
翻译:从原始文本中提取结构化且基于事实的三元组是信息抽取(IE)领域的基础任务。现有IE数据集通常来源于维基百科文章,利用超链接将实体链接到维基数据知识库。然而,仅在维基百科上训练的模型在应用于网络域时存在局限性,因为网络文本常包含噪声或不含任何事实信息。我们提出了WebIE——首个大规模、实体链接的封闭式IE数据集,包含从英文Common Crawl语料库中自动收集的160万条句子。WebIE还包含负例(即不含事实三元组的句子),以更真实地反映网络数据分布。我们通过众包方式为WebIE标注了约2.1万条三元组,并发布了mWebIE——该标注集在法语、西班牙语、葡萄牙语和印地语等四种语言上的翻译版本。我们评估了生成式IE模型在域内、跨域及零样本跨语言场景下的性能,发现基于WebIE训练的模型展现出更强的泛化能力。此外,我们提出了三种利用实体链接作为辅助任务的训练策略。实验表明,加入实体链接目标可以提升生成式IE模型的忠实度。