While multi-modal models have successfully integrated information from image, video, and audio modalities, integrating graph modality into large language models (LLMs) remains unexplored. This discrepancy largely stems from the inherent divergence between structured graph data and unstructured text data. Incorporating graph knowledge provides a reliable source of information, enabling potential solutions to address issues in text generation, e.g., hallucination, and lack of domain knowledge. To evaluate the integration of graph knowledge into language models, a dedicated dataset is needed. However, there is currently no benchmark dataset specifically designed for multimodal graph-language models. To address this gap, we propose GraphextQA, a question answering dataset with paired subgraphs, retrieved from Wikidata, to facilitate the evaluation and future development of graph-language models. Additionally, we introduce a baseline model called CrossGNN, which conditions answer generation on the paired graphs by cross-attending question-aware graph features at decoding. The proposed dataset is designed to evaluate graph-language models' ability to understand graphs and make use of it for answer generation. We perform experiments with language-only models and the proposed graph-language model to validate the usefulness of the paired graphs and to demonstrate the difficulty of the task.
翻译:尽管多模态模型已成功整合来自图像、视频和音频模态的信息,但将图模态整合到大语言模型(LLMs)中仍鲜有探索。这一差距主要源于结构化图数据与非结构化文本数据之间的固有差异。融入图知识可提供可靠的信息源,为解决文本生成中的问题(如幻觉和领域知识缺失)提供潜在方案。为评估图知识在语言模型中的整合效果,需要专门的数据集。然而,目前尚未有针对多模态图-语言模型的基准数据集。为弥补这一空白,我们提出GraphextQA——一个包含从Wikidata检索的配对子图的问答数据集,以促进图-语言模型的评估和未来发展。此外,我们引入名为CrossGNN的基线模型,该模型通过在解码阶段交叉注意力处理问题感知的图特征,使答案生成依赖于配对图。所提出的数据集旨在评估图-语言模型理解图结构并利用其生成答案的能力。我们通过仅使用语言模型和所提出的图-语言模型进行实验,验证配对图的有效性并展示任务的难度。