Context-aware Modality-Topology Co-Alignment for Multimodal Attributed Graphs

Multimodal Attributed Graphs (MAGs) model real-world entities by coupling graph topology with heterogeneous attributes such as text and images. They support graph-centric tasks requiring structural and class-discriminative representations, and modality-centric tasks requiring fine-grained cross-modal correspondence. However, existing MAG methods often rely on fixed graph contexts or uniformly fused representations, causing task-agnostic propagation and over-compressed fusion that hinder diverse task requirements and modality-specific evidence preservation. To address this, we propose CoMAG, a unified MAG backbone that learns task-adaptive reliable contexts and modality-preserving alignment within them. CoMAG first conducts Reliable Context Learning by estimating edge reliability from multimodal semantic consistency, complementing raw topology with semantic neighbors, and selecting context components through a task-aware gate. It then performs Modality-preserving Hop-token Alignment by maintaining modality-specific multi-hop trajectories, matching modality-hop tokens across modalities, and decoupling shared and private representations. Thus, CoMAG produces graph and modality representations from one forward pass while retaining modality-specific cues. We further analyze stable propagation, over-smoothing mitigation, and modality-collapse control. Experiments on nine OpenMAG datasets compare CoMAG with feature-only, graph-only, multimodal, and unified MAG baselines across graph-level prediction, modality matching, and graph-conditioned generation. Results show that CoMAG achieves the best reported performance, demonstrating that task-adaptive reliable contexts and modality-preserving alignment improve structural prediction, cross-modal matching, and graph-conditioned generation while retaining sparse edge-linear complexity.

翻译：多模态属性图通过耦合图拓扑与文本、图像等异质属性来建模现实世界实体。这类图结构同时支持需要结构性和类别判别性表示的图中心任务，以及需要细粒度跨模态对应的模态中心任务。然而，现有MAG方法通常依赖于固定图上下文或统一融合表示，导致任务无关的传播与过度压缩的融合，阻碍了多样化任务需求的满足与模态特定证据的保留。为解决此问题，我们提出统一MAG主干网络CoMAG，该方法能在任务自适应可靠上下文中学习模态保持对齐。CoMAG首先通过多模态语义一致性估计边可靠性、用语义邻居补充原始拓扑结构、并通过任务感知门控选择上下文组件，实现可靠上下文学习；随后通过维护模态特定多跳轨迹、跨模态匹配模态-跳令牌、并解耦共享与私有表示，执行模态保持的跳令牌对齐。由此，CoMAG通过单次前向传播即可生成图表示与模态表示，同时保留模态特定线索。我们进一步分析了稳定传播、过平滑缓解与模态坍缩控制机制。基于九个OpenMAG数据集的实验将CoMAG与纯特征方法、纯图方法、多模态方法及统一MAG基线进行对比，涵盖图级预测、模态匹配及图条件生成任务。结果表明，CoMAG实现了最优报告性能，验证了任务自适应可靠上下文与模态保持对齐在保持稀疏边线性复杂度的同时，能够提升结构预测、跨模态匹配及图条件生成能力。