The task of Information Extraction (IE) involves automatically converting unstructured textual content into structured data. Most research in this field concentrates on extracting all facts or a specific set of relationships from documents. In this paper, we present a method for the extraction and categorisation of an unrestricted set of relationships from text. Our method relies on morpho-syntactic extraction patterns obtained by a distant supervision method, and creates Syntactic and Semantic Indices to extract and classify candidate graphs. We evaluate our approach on six datasets built on Wikidata and Wikipedia. The evaluation shows that our approach can achieve Precision scores of up to 0.85, but with lower Recall and F1 scores. Our approach allows to quickly create rule-based systems for Information Extraction and to build annotated datasets to train machine-learning and deep-learning based classifiers.
翻译:信息抽取任务涉及将非结构化文本内容自动转换为结构化数据。该领域的大多数研究侧重于从文档中提取所有事实或特定关系集合。本文提出了一种从文本中提取并分类无限制关系集合的方法。该方法依赖通过远程监督获得的形态句法抽取模式,并构建句法和语义索引来提取和分类候选图。我们在基于维基数据和维基百科构建的六个数据集上进行了评估。评估结果表明,该方法精确率最高可达0.85,但召回率和F1分数较低。我们的方法能够快速创建基于规则的信息抽取系统,并构建用于训练机器学习和深度学习分类器的标注数据集。