Video summarization aims to eliminate visual redundancy while retaining key parts of video to construct concise and comprehensive synopses. Most existing methods use discriminative models to predict the importance scores of video frames. However, these methods are susceptible to annotation inconsistency caused by the inherent subjectivity of different annotators when annotating the same video. In this paper, we introduce a generative framework for video summarization that learns how to generate summaries from a probability distribution perspective, effectively reducing the interference of subjective annotation noise. Specifically, we propose a novel diffusion summarization method based on the Denoising Diffusion Probabilistic Model (DDPM), which learns the probability distribution of training data through noise prediction, and generates summaries by iterative denoising. Our method is more resistant to subjective annotation noise, and is less prone to overfitting the training data than discriminative methods, with strong generalization ability. Moreover, to facilitate training DDPM with limited data, we employ an unsupervised video summarization model to implement the earlier denoising process. Extensive experiments on various datasets (TVSum, SumMe, and FPVSum) demonstrate the effectiveness of our method.
翻译:视频摘要旨在消除视觉冗余,同时保留视频的关键部分以构建简洁全面的概要。现有方法大多采用判别式模型来预测视频帧的重要性分数。然而,这些方法容易受到不同标注者在标注同一视频时因固有主观性导致的标注不一致性的影响。本文提出了一种基于生成式框架的视频摘要方法,从概率分布的角度学习如何生成摘要,有效降低了主观标注噪声的干扰。具体而言,我们提出了一种基于去噪扩散概率模型(DDPM)的新型扩散摘要方法,该方法通过噪声预测学习训练数据的概率分布,并通过迭代去噪生成摘要。与判别式方法相比,我们的方法对主观标注噪声具有更强的鲁棒性,更不易过拟合训练数据,并具有较强的泛化能力。此外,为在有限数据条件下训练DDPM,我们采用无监督视频摘要模型来实现早期去噪过程。在多个数据集(TVSum、SumMe和FPVSum)上的大量实验证明了本方法的有效性。