Video summarization aims to eliminate visual redundancy while retaining key parts of video to construct concise and comprehensive synopses. Most existing methods use discriminative models to predict the importance scores of video frames. However, these methods are susceptible to annotation inconsistency caused by the inherent subjectivity of different annotators when annotating the same video. In this paper, we introduce a generative framework for video summarization that learns how to generate summaries from a probability distribution perspective, effectively reducing the interference of subjective annotation noise. Specifically, we propose a novel diffusion summarization method based on the Denoising Diffusion Probabilistic Model (DDPM), which learns the probability distribution of training data through noise prediction, and generates summaries by iterative denoising. Our method is more resistant to subjective annotation noise, and is less prone to overfitting the training data than discriminative methods, with strong generalization ability. Moreover, to facilitate training DDPM with limited data, we employ an unsupervised video summarization model to implement the earlier denoising process. Extensive experiments on various datasets (TVSum, SumMe, and FPVSum) demonstrate the effectiveness of our method.
翻译:视频摘要旨在消除视觉冗余,同时保留视频的关键部分以构建简洁全面的概要。现有方法大多采用判别式模型来预测视频帧的重要性分数。然而,这些方法容易受到标注不一致性的影响,这种不一致性源于不同标注者对同一视频进行标注时固有的主观性。本文提出了一种生成式视频摘要框架,从概率分布的角度学习如何生成摘要,有效减少了主观标注噪声的干扰。具体而言,我们基于去噪扩散概率模型(DDPM)提出了一种新颖的扩散摘要方法,该方法通过噪声预测学习训练数据的概率分布,并通过迭代去噪生成摘要。相较于判别式方法,我们的方法对主观标注噪声具有更强的鲁棒性,更不易过拟合训练数据,且具有较强的泛化能力。此外,为在有限数据条件下训练DDPM,我们采用无监督视频摘要模型来实现早期去噪过程。在多个数据集(TVSum、SumMe和FPVSum)上的大量实验证明了我们方法的有效性。