Recently, video summarization has been proposed as a method to help video exploration. However, traditional video summarization models only generate a fixed video summary which is usually independent of user-specific needs and hence limits the effectiveness of video exploration. Multi-modal video summarization is one of the approaches utilized to address this issue. Multi-modal video summarization has a video input and a text-based query input. Hence, effective modeling of the interaction between a video input and text-based query is essential to multi-modal video summarization. In this work, a new causality-based method named Causal Video Summarizer (CVS) is proposed to effectively capture the interactive information between the video and query to tackle the task of multi-modal video summarization. The proposed method consists of a probabilistic encoder and a probabilistic decoder. Based on the evaluation of the existing multi-modal video summarization dataset, experimental results show that the proposed approach is effective with the increase of +5.4% in accuracy and +4.92% increase of F 1- score, compared with the state-of-the-art method.
翻译:近年来,视频摘要被提出作为辅助视频探索的一种方法。然而,传统视频摘要模型仅生成固定摘要,通常与用户特定需求无关,从而限制了视频探索的有效性。多模态视频摘要是解决该问题的途径之一。多模态视频摘要包含视频输入和基于文本的查询输入,因此有效建模视频与文本查询之间的交互对多模态视频摘要至关重要。本文提出一种新的基于因果关系的方法——因果视频摘要器(Causal Video Summarizer, CVS),以有效捕获视频与查询之间的交互信息,从而完成多模态视频摘要任务。该方法包含概率编码器和概率解码器。基于现有多模态视频摘要数据集的评估,实验结果表明,与当前最优方法相比,所提方法的准确率提升了5.4%,F1分数提升了4.92%。