With the rapid growth of video data on the internet, video summarization is becoming a very important AI technology. However, due to the high labelling cost of video summarization, existing studies have to be conducted on small-scale datasets, leading to limited performance and generalization capacity. In this work, we introduce the use of dense video captions as a supervision signal to train video summarization models. Motivated by this, we propose Cap2Sum, a model that learns to summarize videos by generating captions, to exploit dense video caption annotations. This weakly-supervised approach allows us to train the models on large-scale dense video caption datasets to achieve better performance and generalization capacity. To further improve the generalization capacity, we introduce a CLIP (a strong vision-language model) Prior mechanism to enhance the learning of important objects that captions may ignore in the videos. In practice, Cap2Sum can perform zero-shot video summarization or be fine-tuned by the ground-truth summary or video caption of the target dataset. To examine the performance of Cap2Sum after weakly-supervised fine-tuning by the video captions, we propose two new datasets, TVSum-Caption and SumMe-Caption, which are derived from two common video summarization datasets and will be publicly released. We conduct extensive experiments and the results demonstrate that our method achieves significant improvements in performance and generalization capacity compared with previous methods.
翻译:随着互联网视频数据的快速增长,视频摘要正成为一项非常重要的AI技术。然而,由于视频摘要的标注成本高昂,现有研究只能在小型数据集上进行,导致性能和泛化能力有限。在本工作中,我们引入密集视频字幕作为监督信号来训练视频摘要模型。受此启发,我们提出了Cap2Sum模型,该模型通过学习生成字幕来进行视频摘要,以利用密集视频字幕标注。这种弱监督方法使我们能够在大型密集视频字幕数据集上训练模型,从而获得更好的性能和泛化能力。为了进一步提升泛化能力,我们引入了CLIP(一种强大的视觉-语言模型)先验机制,以增强对字幕可能忽略的视频中重要对象的学习。在实际应用中,Cap2Sum可以进行零样本视频摘要,或通过目标数据集的真实摘要或视频字幕进行微调。为了检验Cap2Sum在通过视频字幕进行弱监督微调后的性能,我们提出了两个新数据集TVSum-Caption和SumMe-Caption,它们源自两个常用的视频摘要数据集,并将公开发布。我们进行了大量实验,结果表明,与先前方法相比,我们的方法在性能和泛化能力方面均取得了显著提升。