Many-to-many multimodal summarization (M$^3$S) task aims to generate summaries in any language with document inputs in any language and the corresponding image sequence, which essentially comprises multimodal monolingual summarization (MMS) and multimodal cross-lingual summarization (MXLS) tasks. Although much work has been devoted to either MMS or MXLS and has obtained increasing attention in recent years, little research pays attention to the M$^3$S task. Besides, existing studies mainly focus on 1) utilizing MMS to enhance MXLS via knowledge distillation without considering the performance of MMS or 2) improving MMS models by filtering summary-unrelated visual features with implicit learning or explicitly complex training objectives. In this paper, we first introduce a general and practical task, i.e., M$^3$S. Further, we propose a dual knowledge distillation and target-oriented vision modeling framework for the M$^3$S task. Specifically, the dual knowledge distillation method guarantees that the knowledge of MMS and MXLS can be transferred to each other and thus mutually prompt both of them. To offer target-oriented visual features, a simple yet effective target-oriented contrastive objective is designed and responsible for discarding needless visual information. Extensive experiments on the many-to-many setting show the effectiveness of the proposed approach. Additionally, we will contribute a many-to-many multimodal summarization (M$^3$Sum) dataset.
翻译:多对多多模态摘要(M$^3$S)任务旨在基于任意语言的文档输入及对应图像序列,生成任意语言的摘要,其本质上包含多模态单语摘要(MMS)与多模态跨语言摘要(MXLS)任务。尽管近年来大量工作致力于MMS或MXLS并取得广泛关注,但针对M$^3$S任务的研究仍十分有限。此外,现有研究主要聚焦于:1)通过知识蒸馏利用MMS增强MXLS,但未考虑MMS自身性能;2)通过隐式学习或显式复杂训练目标过滤与摘要无关的视觉特征以改进MMS模型。本文首次提出通用且实用的M$^3$S任务,并进一步构建了面向该任务的双重知识蒸馏与目标导向视觉建模框架。具体而言,双重知识蒸馏方法确保MMS与MXLS的知识可相互迁移,从而促进彼此性能提升;为提供目标导向的视觉特征,我们设计了简洁有效的目标导向对比学习目标,负责剔除无关视觉信息。在多对多设置下的大量实验验证了所提方法的有效性。此外,我们将贡献一个多对多多模态摘要数据集(M$^3$Sum)。