2M-NER: Contrastive Learning for Multilingual and Multimodal NER with Language and Modal Fusion

Named entity recognition (NER) is a fundamental task in natural language processing that involves identifying and classifying entities in sentences into pre-defined types. It plays a crucial role in various research fields, including entity linking, question answering, and online product recommendation. Recent studies have shown that incorporating multilingual and multimodal datasets can enhance the effectiveness of NER. This is due to language transfer learning and the presence of shared implicit features across different modalities. However, the lack of a dataset that combines multilingualism and multimodality has hindered research exploring the combination of these two aspects, as multimodality can help NER in multiple languages simultaneously. In this paper, we aim to address a more challenging task: multilingual and multimodal named entity recognition (MMNER), considering its potential value and influence. Specifically, we construct a large-scale MMNER dataset with four languages (English, French, German and Spanish) and two modalities (text and image). To tackle this challenging MMNER task on the dataset, we introduce a new model called 2M-NER, which aligns the text and image representations using contrastive learning and integrates a multimodal collaboration module to effectively depict the interactions between the two modalities. Extensive experimental results demonstrate that our model achieves the highest F1 score in multilingual and multimodal NER tasks compared to some comparative and representative baselines. Additionally, in a challenging analysis, we discovered that sentence-level alignment interferes a lot with NER models, indicating the higher level of difficulty in our dataset.

翻译：命名实体识别（NER）是自然语言处理中的基础任务，旨在识别句子中的实体并将其分类至预定义类型。该任务在实体链接、问答系统和在线商品推荐等多个研究领域发挥关键作用。近年研究表明，融合多语言和多模态数据集能够提升NER的有效性，这得益于语言迁移学习以及不同模态间共享的隐含特征。然而，由于缺乏同时结合多语言与多模态特性的数据集，探索二者融合的研究进展受阻——多模态本可同时辅助多种语言的NER任务。本文旨在解决更具挑战性的多语言多模态命名实体识别（MMNER）任务，充分考虑其潜在价值与影响力。具体而言，我们构建了一个包含四种语言（英语、法语、德语和西班牙语）和两种模态（文本与图像）的大规模MMNER数据集。针对该数据集上的MMNER挑战，我们提出名为2M-NER的新型模型：通过对比学习对齐文本与图像表征，并集成多模态协作模块以有效刻画两种模态间的交互。大量实验结果表明，与多个对比基准模型和代表性方法相比，我们的模型在多语言与多模态NER任务中取得了最高F1分数。此外，在挑战性分析中，我们发现句子级对齐会严重干扰NER模型，这进一步验证了数据集的更高难度。