This paper studies the multimodal named entity recognition (MNER) and multimodal relation extraction (MRE), which are important for multimedia social platform analysis. The core of MNER and MRE lies in incorporating evident visual information to enhance textual semantics, where two issues inherently demand investigations. The first issue is modality-noise, where the task-irrelevant information in each modality may be noises misleading the task prediction. The second issue is modality-gap, where representations from different modalities are inconsistent, preventing from building the semantic alignment between the text and image. To address these issues, we propose a novel method for MNER and MRE by Multi-Modal representation learning with Information Bottleneck (MMIB). For the first issue, a refinement-regularizer probes the information-bottleneck principle to balance the predictive evidence and noisy information, yielding expressive representations for prediction. For the second issue, an alignment-regularizer is proposed, where a mutual information-based item works in a contrastive manner to regularize the consistent text-image representations. To our best knowledge, we are the first to explore variational IB estimation for MNER and MRE. Experiments show that MMIB achieves the state-of-the-art performances on three public benchmarks.
翻译:本文研究多模态命名实体识别(MNER)与多模态关系抽取(MRE),这两项任务对多媒体社交平台分析至关重要。MNER与MRE的核心在于利用显著的视觉信息增强文本语义,其中两个问题亟待探究。第一个问题是模态噪声,即每个模态中与任务无关的信息可能成为误导任务预测的噪声。第二个问题是模态鸿沟,即不同模态的表征不一致,阻碍文本与图像间的语义对齐构建。为解决这些问题,我们提出了一种基于信息瓶颈的多模态表征学习方法(MMIB),用于MNER与MRE。针对第一个问题,精炼正则化项通过信息瓶颈原理平衡预测性证据与噪声信息,从而生成富有表现力的预测表征。针对第二个问题,提出对齐正则化项,其中基于互信息的项以对比学习方式约束文本与图像表征的一致性。据我们所知,这是首次将变分信息瓶颈估计应用于MNER与MRE。实验表明,MMIB在三个公开基准上取得了最先进的性能。