Large Language Models (LLMs) are increasingly used for various tasks with graph structures. Though LLMs can process graph information in a textual format, they overlook the rich vision modality, which is an intuitive way for humans to comprehend structural information and conduct general graph reasoning. The potential benefits and capabilities of representing graph structures as visual images (i.e., $\textit{visual graph}$) are still unexplored. To fill the gap, we innovatively propose an end-to-end framework, called $\textbf{G}$raph to v$\textbf{I}$sual and $\textbf{T}$extual Integr$\textbf{A}$tion (GITA), which firstly incorporates visual graphs into general graph reasoning. Besides, we establish $\textbf{G}$raph-based $\textbf{V}$ision-$\textbf{L}$anguage $\textbf{Q}$uestion $\textbf{A}$nswering (GVLQA) dataset from existing graph data, which is the first vision-language dataset for general graph reasoning purposes. Extensive experiments on the GVLQA dataset and five real-world datasets show that GITA outperforms mainstream LLMs in terms of general graph reasoning capabilities. Moreover, We highlight the effectiveness of the layout augmentation on visual graphs and pretraining on the GVLQA dataset.
翻译:大型语言模型(LLMs)正日益广泛地应用于各类具有图结构的任务中。尽管LLMs能够以文本格式处理图信息,但它们忽略了丰富的视觉模态——而视觉正是人类理解结构信息并进行通用图推理的直观方式。将图结构表示为视觉图像(即$\textit{视觉图}$)的潜在优势与能力尚未得到充分探索。为填补这一空白,我们创新性地提出了一种端到端框架,称为$\textbf{G}$raph to v$\textbf{I}$sual and $\textbf{T}$extual Integr$\textbf{A}$tion(GITA),该框架首次将视觉图融入通用图推理任务。此外,我们基于现有图数据构建了$\textbf{G}$raph-based $\textbf{V}$ision-$\textbf{L}$anguage $\textbf{Q}$uestion $\textbf{A}$nswering(GVLQA)数据集,这是首个面向通用图推理的视觉-语言数据集。在GVLQA数据集及五个真实数据集上的大量实验表明,GITA在通用图推理能力方面优于主流LLMs。此外,我们重点验证了视觉图布局增强技术及在GVLQA数据集上进行预训练的有效性。