Vision-language models (VLMs) have shown promise in graph structure understanding, but remain limited by input-token constraints, facing scalability bottlenecks and lacking effective mechanisms to coordinate textual and visual modalities. To address these challenges, we propose GraphVista, a unified framework that enhances both scalability and modality coordination in graph structure understanding. For scalability, GraphVista organizes graph information hierarchically into a lightweight GraphRAG base, which retrieves only task-relevant textual descriptions and high-resolution visual subgraphs, compressing redundant context while preserving key reasoning elements. For modality coordination, GraphVista introduces a planning agent that decomposes and routes tasks to the most suitable modality-using the text modality for direct access to explicit graph properties and the visual modality for local graph structure reasoning grounded in explicit topology. Extensive experiments demonstrate that GraphVista scales to large graphs, up to 200$\times$ larger than those used in existing benchmarks, and consistently outperforms existing textual, visual, and fusion-based methods, achieving up to 4.4$\times$ quality improvement over the state-of-the-art baselines by fully exploiting the complementary strengths of both modalities.
翻译:视觉语言模型在图结构理解方面展现出潜力,但仍受限于输入令牌约束,面临可扩展性瓶颈且缺乏协调文本与视觉模态的有效机制。为应对这些挑战,我们提出GraphVista——一个增强图结构理解中可扩展性与模态协调性的统一框架。在可扩展性方面,GraphVista将图信息分层组织为轻量级GraphRAG基座,仅检索任务相关的文本描述与高分辨率视觉子图,在压缩冗余上下文的同时保留关键推理要素。在模态协调方面,GraphVista引入规划智能体,将任务分解并路由至最适宜的模态:利用文本模态直接获取显式图属性,借助视觉模态基于显式拓扑进行局部图结构推理。大量实验表明,GraphVista可扩展至大规模图结构(规模达现有基准数据集的200倍),且持续优于现有文本、视觉及融合方法,通过充分发挥双模态互补优势,在质量上较最先进的基线方法提升达4.4倍。