Existing datasets for manually labelled query-based video summarization are costly and thus small, limiting the performance of supervised deep video summarization models. Self-supervision can address the data sparsity challenge by using a pretext task and defining a method to acquire extra data with pseudo labels to pre-train a supervised deep model. In this work, we introduce segment-level pseudo labels from input videos to properly model both the relationship between a pretext task and a target task, and the implicit relationship between the pseudo label and the human-defined label. The pseudo labels are generated based on existing human-defined frame-level labels. To create more accurate query-dependent video summaries, a semantics booster is proposed to generate context-aware query representations. Furthermore, we propose mutual attention to help capture the interactive information between visual and textual modalities. Three commonly-used video summarization benchmarks are used to thoroughly validate the proposed approach. Experimental results show that the proposed video summarization algorithm achieves state-of-the-art performance.
翻译:现有用于人工标注的基于查询的视频摘要数据集成本高昂且规模较小,限制了监督式深度视频摘要模型的性能。自我监督方法可通过预文本任务定义获取伪标签额外数据的方法来预训练监督式深度模型,从而解决数据稀疏问题。本研究从输入视频中引入片段级伪标签,以恰当建模预文本任务与目标任务之间的关系,以及伪标签与人工定义标签之间的隐式关联。这些伪标签基于现有的人工定义帧级标签生成。为创建更精准的查询相关视频摘要,我们提出语义增强器以生成上下文感知的查询表示。此外,提出互注意力机制以捕获视觉与文本模态间的交互信息。采用三个常用视频摘要基准数据集全面验证所提方法,实验结果表明该视频摘要算法达到了当前最优性能。