The goal of multimodal summarization is to extract the most important information from different modalities to form summaries. Unlike unimodal summarization, the multimodal summarization task explicitly leverages cross-modal information to help generate more reliable and high-quality summaries. However, existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples. To address this issue, we introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input. In addition, we propose two novel contrastive losses to model both inter-sample and intra-sample correlations. Extensive experiments on two standard video summarization datasets (TVSum and SumMe) and two multimodal summarization datasets (Daily Mail and CNN) demonstrate the superiority of A2Summ, achieving state-of-the-art performances on all datasets. Moreover, we collected a large-scale multimodal summarization dataset BLiSS, which contains livestream videos and transcribed texts with annotated summaries. Our code and dataset are publicly available at ~\url{https://boheumd.github.io/A2Summ/}.
翻译:多模态摘要的目标是从不同模态中提取关键信息以生成摘要。与单模态摘要不同,多模态摘要任务显式利用跨模态信息来帮助生成更可靠、高质量的摘要。然而,现有方法未能充分利用不同模态之间的时序对应关系,且忽略了不同样本间的内在关联。为解决这一问题,我们提出对齐与聚焦多模态摘要模型(A2Summ),一种基于Transformer的统一多模态模型,能够有效对齐并聚焦多模态输入。此外,我们设计了两种新型对比损失函数,分别建模样本间和样本内的相关性。在两个标准视频摘要数据集(TVSum和SumMe)以及两个多模态摘要数据集(Daily Mail和CNN)上的大量实验表明,A2Summ在所有数据集上均达到最优性能。同时,我们构建了大规模多模态摘要数据集BLiSS,包含直播视频及其转录文本与人工标注摘要。我们的代码与数据集已公开于~\url{https://boheumd.github.io/A2Summ/}。