Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM design that jointly refines textual and visual features through fine-grained cross-modal contrastive learning guided by graph topology. Secondly, a modality-adaptive graph instruction tuning mechanism that organizes aligned multimodal features into graph-aware instruction views and employs a learnable router to surface, for each node and its neighborhood, the most informative modality configuration to the LLM. Extensive experiments across diverse MMG benchmarks demonstrate that Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction. The code will be made available at https://github.com/sunyuanfu/Mario.
翻译:大语言模型的最新进展为多模态推理开辟了新的途径。然而,现有方法大多仍依赖于预训练的视觉语言模型孤立地编码图像-文本对,忽略了真实世界多模态数据自然形成的关联结构。这促使我们开展多模态图(Multimodal Graphs, MMGs)上的推理研究,其中每个节点拥有文本与视觉属性,边则提供结构线索。在保持图拓扑结构的同时,实现基于大语言模型的异构多模态信号推理面临两大挑战:解决弱跨模态一致性问题与处理异质模态偏好。为此,我们提出Mario统一框架,该框架同时解决上述两大挑战,实现MMG上有效的大语言模型推理。Mario包含两个创新阶段:其一,基于图条件的视觉语言模型设计,通过图拓扑引导的细粒度跨模态对比学习,联合精炼文本与视觉特征;其二,模态自适应图指令微调机制,将对齐后的多模态特征组织为图感知指令视图,并利用可学习路由为每个节点及其邻域向大语言模型呈现信息量最丰富的模态配置。在多种MMG基准上的广泛实验表明,Mario在节点分类与链接预测任务的有监督与零样本场景中均持续优于最先进的图模型。代码将发布于https://github.com/sunyuanfu/Mario。