Datasets that pair Knowledge Graphs (KG) and text together (KG-T) can be used to train forward and reverse neural models that generate text from KG and vice versa. However models trained on datasets where KG and text pairs are not equivalent can suffer from more hallucination and poorer recall. In this paper, we verify this empirically by generating datasets with different levels of noise and find that noisier datasets do indeed lead to more hallucination. We argue that the ability of forward and reverse models trained on a dataset to cyclically regenerate source KG or text is a proxy for the equivalence between the KG and the text in the dataset. Using cyclic evaluation we find that manually created WebNLG is much better than automatically created TeKGen and T-REx. Guided by these observations, we construct a new, improved dataset called LAGRANGE using heuristics meant to improve equivalence between KG and text and show the impact of each of the heuristics on cyclic evaluation. We also construct two synthetic datasets using large language models (LLMs), and observe that these are conducive to models that perform significantly well on cyclic generation of text, but less so on cyclic generation of KGs, probably because of a lack of a consistent underlying ontology.
翻译:将知识图谱(KG)与文本配对的数据集(KG-T)可用于训练正向和反向神经模型,这些模型能从知识图谱生成文本,反之亦然。然而,在知识图谱与文本对不等价的训练数据集上训练的模型,更容易产生幻觉且召回率较低。本文通过生成不同噪声水平的数据集进行实证验证,发现噪声较大的数据集确实会导致更多幻觉。我们认为,基于某数据集训练的正向与反向模型能否循环再生原始知识图谱或文本,可作为该数据集中知识图谱与文本等价性的替代指标。利用循环评估,我们发现人工构建的WebNLG数据集显著优于自动生成的TeKGen和T-REx数据集。基于这些观察,我们采用旨在提升知识图谱与文本等价性的启发式方法,构建了名为LAGRANGE的新改进数据集,并展示了各启发式方法对循环评估的影响。我们还利用大语言模型(LLMs)构建了两个合成数据集,并观察到这些数据集有助于模型在文本循环生成任务上表现优异,但在知识图谱循环生成任务上效果较弱,这可能归因于缺乏一致的基础本体。