Multimodal abstractive summarization for videos (MAS) requires generating a concise textual summary to describe the highlights of a video according to multimodal resources, in our case, the video content and its transcript. Inspired by the success of the large-scale generative pre-trained language model (GPLM) in generating high-quality textual content (e.g., summary), recent MAS methods have proposed to adapt the GPLM to this task by equipping it with the visual information, which is often obtained through a general-purpose visual feature extractor. However, the generally extracted visual features may overlook some summary-worthy visual information, which impedes model performance. In this work, we propose a novel approach to learning the summary-worthy visual representation that facilitates abstractive summarization. Our method exploits the summary-worthy information from both the cross-modal transcript data and the knowledge that distills from the pseudo summary. Extensive experiments on three public multimodal datasets show that our method outperforms all competing baselines. Furthermore, with the advantages of summary-worthy visual information, our model can have a significant improvement on small datasets or even datasets with limited training data.
翻译:多模态视频抽象摘要生成(MAS)需要根据多模态资源(本文中为视频内容及其转录文本)生成简洁的文本摘要来描述视频亮点。受大规模生成式预训练语言模型(GPLM)在生成高质量文本内容(如摘要)方面成功经验的启发,近期MAS方法通过配备通常由通用视觉特征提取器获取的视觉信息,将GPLM适配至该任务。然而,通用提取的视觉特征可能忽略某些值得摘要的视觉信息,从而阻碍模型性能。本文提出一种创新方法,通过学习包含摘要要点的视觉表示来促进抽象摘要生成。该方法同时利用跨模态转录数据中的摘要相关信息与从伪摘要中蒸馏的知识。在三个公开多模态数据集上的大量实验表明,本方法优于所有对比基线模型。此外,借助摘要性视觉信息的优势,本模型在小型数据集甚至训练数据有限的数据集上均能取得显著性能提升。