Video summarization remains a huge challenge in computer vision due to the size of the input videos to be summarized. We propose an efficient, language-only video summarizer that achieves competitive accuracy with high data efficiency. Using only textual captions obtained via a zero-shot approach, we train a language transformer model and forego image representations. This method allows us to perform filtration amongst the representative text vectors and condense the sequence. With our approach, we gain explainability with natural language that comes easily for human interpretation and textual summaries of the videos. An ablation study that focuses on modality and data compression shows that leveraging text modality only effectively reduces input data processing while retaining comparable results.
翻译:视频摘要由于待处理输入视频规模庞大,始终是计算机视觉领域的一项重大挑战。本文提出了一种高效且仅依赖语言的视频摘要方法,能以高数据效率实现具有竞争力的准确性。通过零样本方法获取文本描述后,我们训练语言Transformer模型并摒弃图像表征。该方法能够对代表性文本向量进行过滤并压缩序列。基于此方法,我们获得了易于人类理解的自然语言可解释性及视频文本摘要。针对模态与数据压缩的消融研究表明,仅利用文本模态可有效减少输入数据处理量,同时保持可比结果。