Multimodal summarization usually suffers from the problem that the contribution of the visual modality is unclear. Existing multimodal summarization approaches focus on designing the fusion methods of different modalities, while ignoring the adaptive conditions under which visual modalities are useful. Therefore, we propose a novel Coarse-to-Fine contribution network for multimodal Summarization (CFSum) to consider different contributions of images for summarization. First, to eliminate the interference of useless images, we propose a pre-filter module to abandon useless images. Second, to make accurate use of useful images, we propose two levels of visual complement modules, word level and phrase level. Specifically, image contributions are calculated and are adopted to guide the attention of both textual and visual modalities. Experimental results have shown that CFSum significantly outperforms multiple strong baselines on the standard benchmark. Furthermore, the analysis verifies that useful images can even help generate non-visual words which are implicitly represented in the image.
翻译:多模态摘要通常面临视觉模态贡献不明确的问题。现有方法专注于设计不同模态的融合策略,却忽略了视觉模态适用的自适应条件。为此,我们提出一种新颖的由粗到精贡献网络(CFSum)用于多模态摘要,以区分图像对摘要的不同贡献。首先,为消除无用图像的干扰,我们设计了一个预过滤模块来剔除无关图像。其次,为精准利用有效图像,我们提出了两个层级的视觉补充模块:词级与短语级。具体而言,计算图像的贡献度并用于引导文本与视觉模态的注意力机制。实验表明,CFSum在标准基准上显著优于多个强基线模型。进一步分析证实,有效图像甚至能帮助生成图像中隐含表达的非视觉词汇。