Despite its wide range of applications, video summarization is still held back by the scarcity of extensive datasets, largely due to the labor-intensive and costly nature of frame-level annotations. As a result, existing video summarization methods are prone to overfitting. To mitigate this challenge, we propose a novel self-supervised video representation learning method using knowledge distillation to pre-train a transformer encoder. Our method matches its semantic video representation, which is constructed with respect to frame importance scores, to a representation derived from a CNN trained on video classification. Empirical evaluations on correlation-based metrics, such as Kendall's $\tau$ and Spearman's $\rho$ demonstrate the superiority of our approach compared to existing state-of-the-art methods in assigning relative scores to the input frames.
翻译:尽管视频摘要技术应用广泛,但因其帧级标注工作量大、成本高昂,大规模数据集的匮乏始终制约着该领域的发展。现有视频摘要方法因此容易陷入过拟合。为应对此挑战,我们提出一种新颖的自监督视频表征学习方法,通过知识蒸馏对Transformer编码器进行预训练。该方法将基于帧重要性得分构建的语义视频表征,与基于视频分类任务训练的CNN所提取的特征表示进行对齐。基于肯德尔τ系数和斯皮尔曼ρ系数等关联性指标的实证评估表明,在输入帧的相对得分分配任务上,本方法显著优于现有最先进技术。