Summarizing a video requires a diverse understanding of the video, ranging from recognizing scenes to evaluating how much each frame is essential enough to be selected as a summary. Self-supervised learning (SSL) is acknowledged for its robustness and flexibility to multiple downstream tasks, but the video SSL has not shown its value for dense understanding tasks like video summarization. We claim an unsupervised autoencoder with sufficient self-supervised learning does not need any extra downstream architecture design or fine-tuning weights to be utilized as a video summarization model. The proposed method to evaluate the importance score of each frame takes advantage of the reconstruction score of the autoencoder's decoder. We evaluate the method in major unsupervised video summarization benchmarks to show its effectiveness under various experimental settings.
翻译:视频摘要需要对视频内容进行多元理解,涵盖从场景识别到评估每一帧对摘要选取的重要性程度。自监督学习因其对多种下游任务的稳健性与灵活性而备受认可,但在视频理解任务中,自监督学习尚未在密集理解型任务(如视频摘要)中充分展现其价值。我们提出,一个经过充分自监督学习的无监督自编码器,无需额外的下游架构设计或权重微调,即可直接用作视频摘要模型。所提出的方法通过自编码器解码器的重构分数来评估每一帧的重要性分数。我们在主要无监督视频摘要基准上对该方法进行了评估,结果表明其在多种实验设置下均有效。