Multimodal abstractive summarization (MAS) aims to produce a concise summary given the multimodal data (text and vision). Existing studies mainly focus on how to effectively use the visual features from the perspective of an article, having achieved impressive success on the high-resource English dataset. However, less attention has been paid to the visual features from the perspective of the summary, which may limit the model performance, especially in the low- and zero-resource scenarios. In this paper, we propose to improve the summary quality through summary-oriented visual features. To this end, we devise two auxiliary tasks including vision to summary task and masked image modeling task. Together with the main summarization task, we optimize the MAS model via the training objectives of all these tasks. By these means, the MAS model can be enhanced by capturing the summary-oriented visual features, thereby yielding more accurate summaries. Experiments on 44 languages, covering mid-high-, low-, and zero-resource scenarios, verify the effectiveness and superiority of the proposed approach, which achieves state-of-the-art performance under all scenarios. Additionally, we will contribute a large-scale multilingual multimodal abstractive summarization (MM-Sum) dataset.
翻译:多模态抽象摘要(MAS)旨在基于多模态数据(文本和视觉信息)生成简洁的摘要。现有研究主要从文章视角出发,关注如何有效利用视觉特征,并在高资源英语数据集上取得了显著成功。然而,从摘要视角对视觉特征的关注较少,这可能会限制模型性能,尤其是在低资源和零资源场景下。本文提出通过面向摘要的视觉特征来提升摘要质量。为此,我们设计了两个辅助任务,包括视觉到摘要任务和掩码图像建模任务。结合主要摘要任务,我们通过所有任务的训练目标来优化MAS模型。通过这些方法,MAS模型能够通过捕获面向摘要的视觉特征得到增强,从而生成更准确的摘要。在覆盖中高资源、低资源和零资源场景的44种语言上的实验验证了所提方法的有效性和优越性,该方法在所有场景下均达到了最先进性能。此外,我们将贡献一个大规模的多语言多模态抽象摘要(MM-Sum)数据集。