Mining structured knowledge from tweets using named entity recognition (NER) can be beneficial for many down stream applications such as recommendation and intention understanding. With tweet posts tending to be multimodal, multimodal named entity recognition (MNER) has attracted more attention. In this paper, we propose a novel approach, which can dynamically align the image and text sequence and achieve the multi-level cross-modal learning to augment textual word representation for MNER improvement. To be specific, our framework can be split into three main stages: the first stage focuses on intra-modality representation learning to derive the implicit global and local knowledge of each modality, the second evaluates the relevance between the text and its accompanying image and integrates different grained visual information based on the relevance, the third enforces semantic refinement via iterative cross-modal interactions and co-attention. We conduct experiments on two open datasets, and the results and detailed analysis demonstrate the advantage of our model.
翻译:从推文中利用命名实体识别(NER)挖掘结构化知识,有助于众多下游应用(如推荐和意图理解)。随着推文内容趋于多模态,多模态命名实体识别(MNER)受到更多关注。本文提出一种新颖方法,能够动态对齐图像与文本序列,实现多级跨模态学习以增强文本词汇表示,从而提升MNER性能。具体而言,我们的框架分为三个阶段:第一阶段聚焦模态内表示学习,提取各模态的隐式全局与局部知识;第二阶段评估文本与伴随图像的相关性,并基于该相关性整合不同粒度的视觉信息;第三阶段通过迭代跨模态交互与协同注意力机制实现语义精炼。我们在两个公开数据集上进行实验,结果与详细分析证明了本模型的优势。