Vision-Language Models (VLMs) are able to process increasingly longer videos. Yet, important visual information is easily lost throughout the entire context and missed by VLMs. Also, it is important to design tools that enable cost-effective analysis of lengthy video content. In this paper, we propose a clip selection method that targets key video moments to be included in a multimodal summary. We divide the video into short clips and generate compact visual descriptions of each using a lightweight video captioning model. These are then passed to a large language model (LLM), which selects the K clips containing the most relevant visual information for a multimodal summary. We evaluate our approach on reference clips for the task, automatically derived from full human-annotated screenplays and summaries in the MovieSum dataset. We further show that these reference clips (less than 6% of the movie) are sufficient to build a complete multimodal summary of the movies in MovieSum. Using our clip selection method, we achieve a summarization performance close to that of these reference clips while capturing substantially more relevant video information than random clip selection. Importantly, we maintain low computational cost by relying on a lightweight captioning model.


翻译:视觉-语言模型(VLMs)能够处理越来越长的视频。然而,在整个上下文中,重要的视觉信息容易丢失并被VLMs忽略。此外,设计能够经济高效地分析冗长视频内容的工具至关重要。本文提出一种针对关键视频时刻的片段选择方法,以将其纳入多模态摘要中。我们将视频划分为短片段,并使用轻量级视频描述模型为每个片段生成紧凑的视觉描述。这些描述随后被输入大型语言模型(LLM),由LLM选择包含最相关视觉信息的K个片段以构建多模态摘要。我们在MovieSum数据集上评估了该方法,该数据集的参考片段从完整的人工标注剧本和摘要中自动导出。我们进一步证明,这些参考片段(占电影时长不足6%)足以构建MovieSum中电影的完整多模态摘要。使用我们的片段选择方法,我们实现了接近这些参考片段的摘要性能,同时比随机片段选择捕获了显著更多的相关视频信息。重要的是,通过依赖轻量级描述模型,我们保持了较低的计算成本。

0
下载
关闭预览

相关内容

ACM/IEEE第23届模型驱动工程语言和系统国际会议,是模型驱动软件和系统工程的首要会议系列,由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来,模型涵盖了建模的各个方面,从语言和方法到工具和应用程序。模特的参加者来自不同的背景,包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛,参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会,并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。 官网链接:http://www.modelsconference.org/
Top
微信扫码咨询专知VIP会员