SMGFM: Spectral Multimodal Graph Pretraining for Multimodal-Attributed Graphs

Multimodal-attributed graphs (MAGs) couple graph topology with node semantics from text, images, and other modalities. Traditional graph learning contextualizes node semantics by coupling topology with node features. However, this coupling design becomes troublesome in MAGs, where structure-induced and modality-intrinsic semantics may contribute differently to downstream tasks. Structure-induced semantics promote relational consistency through smooth topological variation, whereas modality-intrinsic semantics often encode local, fine-grained distinctions that should not be uniformly smoothed or aligned. Therefore, the key challenge is to identify semantic roles before cross-modal fusion. To this end, we leverage graph-frequency variation as a prior, where low-frequency components capture topology-consistent semantics and high-frequency components preserve modality-specific semantics. Based on this intuition, we propose SMGFM, a spectral multimodal graph pretraining framework that decomposes each modality-specific node signal into graph-frequency bands and assigns band-level semantic roles before cross-modal interaction. Concretely, SMGFM constructs frequency-resolved modality tokens with scalable Chebyshev filters, estimates their coupling reliability through topology-conditioned routing, and performs band-modality interaction before fusion. Its frequency-routed objectives align smooth consensus routes while preserving modality-specific routes, mitigating spatial-domain entanglement and uniform cross-modal alignment. Extensive experiments conducted on the MAG datasets demonstrate that SMGFM achieves state-of-the-art performance across graph-level and modality-level tasks.

翻译：多模态属性图谱将图拓扑结构与来自文本、图像等模态的节点语义相结合。传统图学习通过耦合拓扑结构与节点特征来对节点语义进行语境化。然而，这种耦合设计在多模态属性图谱中变得棘手，因为结构诱导语义和模态固有语义可能对下游任务产生不同贡献。结构诱导语义通过平滑的拓扑变化促进关系一致性，而模态固有语义通常编码局部、细微的区分特征，不应被统一平滑或对齐。因此，关键挑战在于跨模态融合前识别语义角色。为此，我们利用图频率变化作为先验，其中低频分量捕获拓扑一致语义，高频分量保留模态特定语义。基于这一直觉，我们提出SMGFM，一种光谱多模态图预训练框架，它在跨模态交互前将每个模态特定节点信号分解为图频率带，并为每个频带分配语义角色。具体而言，SMGFM利用可扩展的切比雪夫滤波器构建频率分辨的模态令牌，通过拓扑条件路由估计其耦合可靠性，并在融合前执行频带-模态交互。其频率路由目标在保持模态特定路由的同时对齐平滑共识路由，缓解了空间域纠缠和统一跨模态对齐问题。在多模态属性图谱数据集上进行的大量实验表明，SMGFM在图级和模态级任务中均实现了最先进的性能。