SMGFM: Spectral Multimodal Graph Pretraining for Multimodal-Attributed Graphs

Multimodal-attributed graphs (MAGs) couple graph topology with node semantics from text, images, and other modalities. Traditional graph learning contextualizes node semantics by coupling topology with node features. However, this coupling design becomes troublesome in MAGs, where structure-induced and modality-intrinsic semantics may contribute differently to downstream tasks. Structure-induced semantics promote relational consistency through smooth topological variation, whereas modality-intrinsic semantics often encode local, fine-grained distinctions that should not be uniformly smoothed or aligned. Therefore, the key challenge is to identify semantic roles before cross-modal fusion. To this end, we leverage graph-frequency variation as a prior, where low-frequency components capture topology-consistent semantics and high-frequency components preserve modality-specific semantics. Based on this intuition, we propose SMGFM, a spectral multimodal graph pretraining framework that decomposes each modality-specific node signal into graph-frequency bands and assigns band-level semantic roles before cross-modal interaction. Concretely, SMGFM constructs frequency-resolved modality tokens with scalable Chebyshev filters, estimates their coupling reliability through topology-conditioned routing, and performs band-modality interaction before fusion. Its frequency-routed objectives align smooth consensus routes while preserving modality-specific routes, mitigating spatial-domain entanglement and uniform cross-modal alignment. Extensive experiments conducted on the MAG datasets demonstrate that SMGFM achieves state-of-the-art performance across graph-level and modality-level tasks.

翻译：多模态属性图谱通过文本、图像及其他模态将图拓扑结构与节点语义相耦合。传统图学习通过拓扑与节点特征的耦合来语境化节点语义，但这种耦合设计在多模态属性图谱中会带来问题——结构诱导语义与模态固有语义可能对下游任务产生不同贡献。结构诱导语义通过平滑拓扑变化促进关系一致性，而模态固有语义通常编码局部细粒度差异，不应被统一平滑或对齐。因此，关键挑战在于跨模态融合前需识别语义角色。为此，我们以图谱频率变化作为先验知识，其中低频分量捕获拓扑一致性语义，高频分量保留模态特异性语义。基于这一直觉，我们提出SMGFM——一种频谱多模态图预训练框架，该框架先将各模态特有节点信号分解为图谱频带，再在跨模态交互前为各频带分配语义角色。具体而言，SMGFM利用可扩展切比雪夫滤波器构建频率解析模态标记，通过拓扑条件路由估计其耦合可靠性，并在融合前执行频带-模态交互。其频率路由目标函数在保留模态特异性路径的同时对齐平滑共识路径，从而缓解空间域纠缠与统一跨模态对齐问题。在多个多模态属性图谱数据集上的大量实验表明，SMGFM在图级与模态级任务中均达到最优性能。