The goal of multimodal summarization is to extract the most important information from different modalities to form summaries. Unlike unimodal summarization, the multimodal summarization task explicitly leverages cross-modal information to help generate more reliable and high-quality summaries. However, existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples. To address this issue, we introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input. In addition, we propose two novel contrastive losses to model both inter-sample and intra-sample correlations. Extensive experiments on two standard video summarization datasets (TVSum and SumMe) and two multimodal summarization datasets (Daily Mail and CNN) demonstrate the superiority of A2Summ, achieving state-of-the-art performances on all datasets. Moreover, we collected a large-scale multimodal summarization dataset BLiSS, which contains livestream videos and transcribed texts with annotated summaries. Our code and dataset are publicly available at ~\url{https://boheumd.github.io/A2Summ/}.
翻译:多模态摘要的目标是从不同模态中提取最重要的信息以生成摘要。与单模态摘要不同,多模态摘要任务显式利用跨模态信息来帮助生成更可靠、高质量的摘要。然而,现有方法未能充分利用不同模态之间的时序对应关系,且忽略了不同样本间的内在关联。为解决这一问题,我们提出"对齐与关注多模态摘要"(A2Summ),一种基于Transformer的统一多模态模型,能够有效对齐与关注多模态输入。此外,我们提出两种新型对比损失函数,分别用于建模样本间与样本内的相关性。在两个标准视频摘要数据集(TVSum和SumMe)以及两个多模态摘要数据集(Daily Mail和CNN)上的大量实验表明,A2Summ在所有数据集上均取得了最先进性能,展现出其优越性。我们还构建了大规模多模态摘要数据集BLiSS,其中包含直播视频、转录文本及标注摘要。我们的代码和数据集已在~\url{https://boheumd.github.io/A2Summ/}公开。