Mining structured knowledge from tweets using named entity recognition (NER) can be beneficial for many downstream applications such as recommendation and intention under standing. With tweet posts tending to be multimodal, multimodal named entity recognition (MNER) has attracted more attention. In this paper, we propose a novel approach, which can dynamically align the image and text sequence and achieve the multi-level cross-modal learning to augment textual word representation for MNER improvement. To be specific, our framework can be split into three main stages: the first stage focuses on intra-modality representation learning to derive the implicit global and local knowledge of each modality, the second evaluates the relevance between the text and its accompanying image and integrates different grained visual information based on the relevance, the third enforces semantic refinement via iterative cross-modal interactions and co-attention. We conduct experiments on two open datasets, and the results and detailed analysis demonstrate the advantage of our model.
翻译:利用命名实体识别(NER)从推文中挖掘结构化知识,可有效支撑推荐、意图理解等下游应用。随着推文内容日益呈现多模态特性,多模态命名实体识别(MNER)受到广泛关注。本文提出一种新方法,能够动态对齐图像与文本序列,并实现多层级跨模态学习以增强文本词汇表示,从而提升MNER性能。具体而言,本框架可划分为三个主要阶段:第一阶段聚焦于模态内表征学习,以提取各模态的隐式全局与局部知识;第二阶段评估文本及其伴随图像之间的相关性,并基于该相关性整合不同粒度的视觉信息;第三阶段则通过迭代跨模态交互与协同注意力机制实施语义精炼。我们在两个公开数据集上开展实验,结果与详细分析验证了本模型的有效性。