The advancement of Multi-modal Pre-training highlights the necessity for a robust Multi-Modal Knowledge Graph (MMKG) representation learning framework. This framework is crucial for integrating structured knowledge into multi-modal Large Language Models (LLMs) at scale, aiming to alleviate issues like knowledge misconceptions and multi-modal hallucinations. In this work, to evaluate models' ability to accurately embed entities within MMKGs, we focus on two widely researched tasks: Multi-modal Knowledge Graph Completion (MKGC) and Multi-modal Entity Alignment (MMEA). Building on this foundation, we propose a novel SNAG method that utilizes a Transformer-based architecture equipped with modality-level noise masking for the robust integration of multi-modal entity features in KGs. By incorporating specific training objectives for both MKGC and MMEA, our approach achieves SOTA performance across a total of ten datasets (three for MKGC and seven for MEMA), demonstrating its robustness and versatility. Besides, SNAG can not only function as a standalone model but also enhance other existing methods, providing stable performance improvements. Our code and data are available at: https://github.com/zjukg/SNAG.
翻译:多模态预训练的进展凸显了构建稳健的多模态知识图谱表示学习框架的必要性。该框架对于将结构化知识大规模集成到多模态大型语言模型中至关重要,旨在缓解如知识误解和多模态幻觉等问题。在本工作中,为评估模型在多模态知识图谱中准确嵌入实体的能力,我们聚焦于两个广泛研究的任务:多模态知识图谱补全与多模态实体对齐。基于此基础,我们提出了一种名为SNAG的新方法,该方法采用基于Transformer的架构,并配备模态级噪声遮罩机制,以实现知识图谱中多模态实体特征的鲁棒集成。通过针对MKGC和MMEA设置特定的训练目标,我们的方法在总计十个数据集(三个用于MKGC,七个用于MMEA)上取得了最先进性能,展示了其鲁棒性和通用性。此外,SNAG不仅能作为独立模型使用,还能增强其他现有方法,提供稳定的性能提升。我们的代码和数据见:https://github.com/zjukg/SNAG。