Multi-modal named entity recognition (MNER) aims at identifying entity spans and recognizing their categories in social media posts with the aid of images. However, in dominant MNER approaches, the interaction of different modalities is usually carried out through the alternation of self-attention and cross-attention or over-reliance on the gating machine, which results in imprecise and biased correspondence between fine-grained semantic units of text and image. To address this issue, we propose a Flat Multi-modal Interaction Transformer (FMIT) for MNER. Specifically, we first utilize noun phrases in sentences and general domain words to obtain visual cues. Then, we transform the fine-grained semantic representation of the vision and text into a unified lattice structure and design a novel relative position encoding to match different modalities in Transformer. Meanwhile, we propose to leverage entity boundary detection as an auxiliary task to alleviate visual bias. Experiments show that our methods achieve the new state-of-the-art performance on two benchmark datasets.
翻译:多模态命名实体识别(MNER)旨在借助图像识别社交媒体帖子中的实体跨度并判断其类别。然而,当前主流的MNER方法中,不同模态的交互通常通过自注意力与交叉注意力的交替或过度依赖门控机制来实现,这导致文本与图像细粒度语义单元之间的对应关系不精确且存在偏差。为解决此问题,我们提出了一种用于MNER的扁平化多模态交互变压器(FMIT)。具体而言,我们首先利用句子中的名词短语和通用领域词汇获取视觉线索;随后,将视觉与文本的细粒度语义表示转换为统一的晶格结构,并设计了一种新颖的相对位置编码以匹配Transformer中的不同模态;同时,提出将实体边界检测作为辅助任务以缓解视觉偏差。实验表明,我们的方法在两个基准数据集上达到了新的最优性能。