Multi-modal keyphrase generation aims to produce a set of keyphrases that represent the core points of the input text-image pair. In this regard, dominant methods mainly focus on multi-modal fusion for keyphrase generation. Nevertheless, there are still two main drawbacks: 1) only a limited number of sources, such as image captions, can be utilized to provide auxiliary information. However, they may not be sufficient for the subsequent keyphrase generation. 2) the input text and image are often not perfectly matched, and thus the image may introduce noise into the model. To address these limitations, in this paper, we propose a novel multi-modal keyphrase generation model, which not only enriches the model input with external knowledge, but also effectively filters image noise. First, we introduce external visual entities of the image as the supplementary input to the model, which benefits the cross-modal semantic alignment for keyphrase generation. Second, we simultaneously calculate an image-text matching score and image region-text correlation scores to perform multi-granularity image noise filtering. Particularly, we introduce the correlation scores between image regions and ground-truth keyphrases to refine the calculation of the previously-mentioned correlation scores. To demonstrate the effectiveness of our model, we conduct several groups of experiments on the benchmark dataset. Experimental results and in-depth analyses show that our model achieves the state-of-the-art performance. Our code is available on https://github.com/DeepLearnXMU/MM-MKP.
翻译:多模态关键词生成旨在生成一组代表输入文本-图像对核心要点的关键词。现有主流方法主要聚焦于多模态融合以生成关键词,但仍存在两个主要缺陷:1)仅能利用有限来源(如图像描述)提供辅助信息,但这类信息可能不足以支撑后续的关键词生成;2)输入文本与图像通常无法完美匹配,导致图像可能向模型引入噪声。为解决上述问题,本文提出一种新型多模态关键词生成模型,既能通过外部知识丰富模型输入,又能有效过滤图像噪声。首先,我们将图像的外部视觉实体作为补充输入引入模型,这有助于跨模态语义对齐以提升关键词生成效果。其次,我们同步计算图像-文本匹配分数和图像区域-文本相关性分数,实现多粒度图像噪声过滤。特别地,本文引入图像区域与真实关键词之间的相关性分数,以优化前述相关性分数的计算。为验证模型有效性,我们在基准数据集上开展多组实验。实验结果与深入分析表明,本模型达到了当前最优性能。相关代码已开源至 https://github.com/DeepLearnXMU/MM-MKP。