This paper presents a novel approach for unsupervised video summarization using reinforcement learning. It aims to address the existing limitations of current unsupervised methods, including unstable training of adversarial generator-discriminator architectures and reliance on hand-crafted reward functions for quality evaluation. The proposed method is based on the concept that a concise and informative summary should result in a reconstructed video that closely resembles the original. The summarizer model assigns an importance score to each frame and generates a video summary. In the proposed scheme, reinforcement learning, coupled with a unique reward generation pipeline, is employed to train the summarizer model. The reward generation pipeline trains the summarizer to create summaries that lead to improved reconstructions. It comprises a generator model capable of reconstructing masked frames from a partially masked video, along with a reward mechanism that compares the reconstructed video from the summary against the original. The video generator is trained in a self-supervised manner to reconstruct randomly masked frames, enhancing its ability to generate accurate summaries. This training pipeline results in a summarizer model that better mimics human-generated video summaries compared to methods relying on hand-crafted rewards. The training process consists of two stable and isolated training steps, unlike adversarial architectures. Experimental results demonstrate promising performance, with F-scores of 62.3 and 54.5 on TVSum and SumMe datasets, respectively. Additionally, the inference stage is 300 times faster than our previously reported state-of-the-art method.
翻译:本文提出了一种利用强化学习的无监督视频摘要新方法,旨在解决当前无监督方法存在的局限性,包括对抗性生成器-判别器架构训练不稳定以及依赖人工设计的奖励函数进行质量评估。该方法基于以下理念:简洁且信息丰富的摘要应能重建出与原始视频高度相似的视频。摘要模型为每一帧分配重要性分数并生成视频摘要。在所提方案中,我们采用强化学习结合独特的奖励生成流程来训练摘要模型。该奖励生成流程通过训练摘要模型生成能提升重建质量的摘要,其包含一个能够从部分掩码视频中重建掩码帧的生成器模型,以及一个将摘要重建视频与原始视频进行比较的奖励机制。视频生成器以自监督方式训练重建随机掩码帧,从而提升其生成准确摘要的能力。相较于依赖人工设计奖励的方法,该训练流程产生的摘要模型能更好地模拟人工生成的视频摘要。与对抗性架构不同,此训练过程包含两个稳定且独立的训练步骤。实验结果表明该方法性能优异,在TVSum和SumMe数据集上分别达到62.3和54.5的F分数。此外,其推理阶段速度比我们先前报道的最先进方法快300倍。