Construction of Paired Knowledge Graph-Text Datasets Informed by Cyclic Evaluation

Datasets that pair Knowledge Graphs (KG) and text together (KG-T) can be used to train forward and reverse neural models that generate text from KG and vice versa. However models trained on datasets where KG and text pairs are not equivalent can suffer from more hallucination and poorer recall. In this paper, we verify this empirically by generating datasets with different levels of noise and find that noisier datasets do indeed lead to more hallucination. We argue that the ability of forward and reverse models trained on a dataset to cyclically regenerate source KG or text is a proxy for the equivalence between the KG and the text in the dataset. Using cyclic evaluation we find that manually created WebNLG is much better than automatically created TeKGen and T-REx. Guided by these observations, we construct a new, improved dataset called LAGRANGE using heuristics meant to improve equivalence between KG and text and show the impact of each of the heuristics on cyclic evaluation. We also construct two synthetic datasets using large language models (LLMs), and observe that these are conducive to models that perform significantly well on cyclic generation of text, but less so on cyclic generation of KGs, probably because of a lack of a consistent underlying ontology.

翻译：将知识图谱（KG）与文本配对的数据集（KG-T）可用于训练正向和反向神经模型，这些模型能从知识图谱生成文本，反之亦然。然而，在知识图谱与文本对不等价的训练数据集上训练的模型，更容易产生幻觉且召回率较低。本文通过生成不同噪声水平的数据集进行实证验证，发现噪声较大的数据集确实会导致更多幻觉。我们认为，基于某数据集训练的正向与反向模型能否循环再生原始知识图谱或文本，可作为该数据集中知识图谱与文本等价性的替代指标。利用循环评估，我们发现人工构建的WebNLG数据集显著优于自动生成的TeKGen和T-REx数据集。基于这些观察，我们采用旨在提升知识图谱与文本等价性的启发式方法，构建了名为LAGRANGE的新改进数据集，并展示了各启发式方法对循环评估的影响。我们还利用大语言模型（LLMs）构建了两个合成数据集，并观察到这些数据集有助于模型在文本循环生成任务上表现优异，但在知识图谱循环生成任务上效果较弱，这可能归因于缺乏一致的基础本体。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日