The rise of social media and the exponential growth of multimodal communication necessitates advanced techniques for Multimodal Information Extraction (MIE). However, existing methodologies primarily rely on direct Image-Text interactions, a paradigm that often faces significant challenges due to semantic and modality gaps between images and text. In this paper, we introduce a new paradigm of Image-Context-Text interaction, where large multimodal models (LMMs) are utilized to generate descriptive textual context to bridge these gaps. In line with this paradigm, we propose a novel Shapley Value-based Contrastive Alignment (Shap-CA) method, which aligns both context-text and context-image pairs. Shap-CA initially applies the Shapley value concept from cooperative game theory to assess the individual contribution of each element in the set of contexts, texts and images towards total semantic and modality overlaps. Following this quantitative evaluation, a contrastive learning strategy is employed to enhance the interactive contribution within context-text/image pairs, while minimizing the influence across these pairs. Furthermore, we design an adaptive fusion module for selective cross-modal fusion. Extensive experiments across four MIE datasets demonstrate that our method significantly outperforms existing state-of-the-art methods.
翻译:随着社交媒体的兴起和多模态通信的指数级增长,对多模态信息抽取(MIE)的先进技术需求日益迫切。然而,现有方法主要依赖于直接的图像-文本交互范式,该范式常因图像与文本之间的语义鸿沟和模态差异而面临显著挑战。本文提出一种新的图像-上下文-文本交互范式,利用大型多模态模型(LMMs)生成描述性文本上下文以弥合这些鸿沟。基于此范式,我们提出一种新颖的基于Shapley值的对比对齐方法(Shap-CA),该方法同时对上下文-文本和上下文-图像对进行对齐。Shap-CA首先运用合作博弈论中的Shapley值概念,量化评估上下文、文本和图像集合中各元素对整体语义与模态重叠度的独立贡献。通过此定量评估,采用对比学习策略增强上下文-文本/图像对内的交互贡献,同时抑制跨对之间的影响。此外,我们设计了自适应融合模块以实现选择性跨模态融合。在四个MIE数据集上的大量实验表明,本方法显著优于现有最先进方法。