Vision-Language Models (VLMs) are able to process increasingly longer videos. Yet, important visual information is easily lost throughout the entire context and missed by VLMs. Also, it is important to design tools that enable cost-effective analysis of lengthy video content. In this paper, we propose a clip selection method that targets key video moments to be included in a multimodal summary. We divide the video into short clips and generate compact visual descriptions of each using a lightweight video captioning model. These are then passed to a large language model (LLM), which selects the K clips containing the most relevant visual information for a multimodal summary. We evaluate our approach on reference clips for the task, automatically derived from full human-annotated screenplays and summaries in the MovieSum dataset. We further show that these reference clips (less than 6% of the movie) are sufficient to build a complete multimodal summary of the movies in MovieSum. Using our clip selection method, we achieve a summarization performance close to that of these reference clips while capturing substantially more relevant video information than random clip selection. Importantly, we maintain low computational cost by relying on a lightweight captioning model.
翻译:视觉-语言模型(VLMs)能够处理越来越长的视频。然而,在整个上下文中,重要的视觉信息容易丢失并被VLMs忽略。此外,设计能够经济高效地分析冗长视频内容的工具至关重要。本文提出一种针对关键视频时刻的片段选择方法,以将其纳入多模态摘要中。我们将视频划分为短片段,并使用轻量级视频描述模型为每个片段生成紧凑的视觉描述。这些描述随后被输入大型语言模型(LLM),由LLM选择包含最相关视觉信息的K个片段以构建多模态摘要。我们在MovieSum数据集上评估了该方法,该数据集的参考片段从完整的人工标注剧本和摘要中自动导出。我们进一步证明,这些参考片段(占电影时长不足6%)足以构建MovieSum中电影的完整多模态摘要。使用我们的片段选择方法,我们实现了接近这些参考片段的摘要性能,同时比随机片段选择捕获了显著更多的相关视频信息。重要的是,通过依赖轻量级描述模型,我们保持了较低的计算成本。