KGCleaner is a framework to identify and correct errors in data produced and delivered by an information extraction system. These tasks have been understudied and KGCleaner is the first to address both. We introduce a multi-task model that jointly learns to predict if an extracted relation is credible and repair it if not. We evaluate our approach and other models as instance of our framework on two collections: a Wikidata corpus of nearly 700K facts and 5M fact-relevant sentences and a collection of 30K facts from the 2015 TAC Knowledge Base Population task. For credibility classification, parameter efficient simple shallow neural network can achieve an absolute performance gain of 30 $F_1$ points on Wikidata and comparable performance on TAC. For the repair task, significant performance (at more than twice) gain can be obtained depending on the nature of the dataset and the models.
翻译:KGCleaner是一个识别并纠正信息抽取系统所产生数据中错误的框架。这些任务此前研究不足,而KGCleaner是首个同时处理这两项工作的系统。我们提出一种多任务模型,该模型联合学习预测抽取关系是否可信,并在不可信时进行修复。我们将所提方法及其他模型作为框架的实例,在两个数据集上进行了评估:一个包含近70万事实和500万事实相关句子的Wikidata语料库,以及2015年TAC知识库人口任务中3万个事实的集合。在可信度分类任务中,参数高效且结构简单的浅层神经网络在Wikidata上可实现30个$F_1$点的绝对性能提升,并在TAC上取得可比性能。对于修复任务,根据数据集和模型的性质,可获得显著(超过两倍)的性能增益。