Vision-Language Models (VLMs) are able to process increasingly longer videos. Yet, important visual information is easily lost throughout the entire context and missed by VLMs. Also, it is important to design tools that enable cost-effective analysis of lengthy video content. In this paper, we propose a clip selection method that targets key video moments to be included in a multimodal summary. We divide the video into short clips and generate compact visual descriptions of each using a lightweight video captioning model. These are then passed to a large language model (LLM), which selects the K clips containing the most relevant visual information for a multimodal summary. We evaluate our approach on reference clips for the task, automatically derived from full human-annotated screenplays and summaries in the MovieSum dataset. We further show that these reference clips (less than 6% of the movie) are sufficient to build a complete multimodal summary of the movies in MovieSum. Using our clip selection method, we achieve a summarization performance close to that of these reference clips while capturing substantially more relevant video information than random clip selection. Importantly, we maintain low computational cost by relying on a lightweight captioning model.
翻译:视觉语言模型(VLMs)能够处理日益增长的长视频内容。然而,重要的视觉信息在整个上下文语境中极易丢失,从而被VLMs所忽略。同时,设计能够对冗长视频内容进行经济高效分析的工具至关重要。本文提出一种针对关键视频时刻的片段选择方法,以构建多模态摘要。我们将视频分割为短片段,并利用轻量级视频描述模型为每个片段生成简洁的视觉描述。随后将这些描述输入大型语言模型(LLM),由LLM筛选出包含最相关视觉信息的K个片段以构成多模态摘要。我们在MovieSum数据集上,通过从完整人工标注的剧本和摘要中自动推导出的参考片段来评估该方法。进一步研究表明,这些参考片段(不足电影时长的6%)足以构建MovieSum数据集中电影的完整多模态摘要。采用我们的片段选择方法,在显著优于随机片段选择的同时,我们实现的摘要性能接近这些参考片段的效果。重要的是,通过依赖轻量级描述模型,我们保持了较低的计算成本。