Large Language Models (LLMs) are increasingly used for various tasks with graph structures, such as robotic planning, knowledge graph completion, and common-sense reasoning. Though LLMs can comprehend graph information in a textual format, they overlook the rich visual modality, which is an intuitive way for humans to comprehend structural information and conduct graph reasoning. The potential benefits and capabilities of representing graph structures as visual images (i.e., visual graph) is still unexplored. In this paper, we take the first step in incorporating visual information into graph reasoning tasks and propose a new benchmark GITQA, where each sample is a tuple (graph, image, textual description). We conduct extensive experiments on the GITQA benchmark using state-of-the-art multimodal LLMs. Results on graph reasoning tasks show that combining textual and visual information together performs better than using one modality alone. Moreover, the LLaVA-7B/13B models finetuned on the training set (referred to as GITA), achieve higher accuracy than the closed-source model GPT-4(V). We also study the effects of augmentations in graph reasoning.
翻译:大语言模型(LLMs)正越来越多地被用于处理图结构的各类任务,例如机器人规划、知识图谱补全和常识推理。尽管LLMs能够以文本形式理解图信息,但它们忽略了丰富的视觉模态——而视觉模态是人类直观理解结构信息并进行图推理的自然方式。将图结构表示为视觉图像(即视觉图)的潜在优势与能力尚未得到探索。本文首次将视觉信息融入图推理任务,并提出一个新的基准数据集GITQA,其中每个样本是一个三元组(图、图像、文本描述)。我们使用最先进的多模态LLMs在GITQA基准上进行了广泛实验。图推理任务的结果表明,结合文本与视觉信息的表现优于仅使用单一模态。此外,在训练集上微调的LLaVA-7B/13B模型(称为GITA)达到了比闭源模型GPT-4(V)更高的准确率。我们还研究了增强操作在图推理中的效果。