Video summarization aims to create short, accurate, and cohesive summaries of longer videos. Despite the existence of various video summarization datasets, a notable limitation is their limited amount of source videos, which hampers the effective fine-tuning of advanced large vision-language models (VLMs). Additionally, most existing datasets are created for video-to-video summarization, overlooking the contemporary need for multimodal video content summarization. Recent efforts have been made to expand from unimodal to multimodal video summarization, categorizing the task into three sub-tasks based on the summary's modality: video-to-video (V2V), video-to-text (V2T), and a combination of video and text summarization (V2VT). However, the textual summaries in previous multimodal datasets are inadequate. To address these issues, we introduce Instruct-V2Xum, a cross-modal video summarization dataset featuring 30,000 diverse videos sourced from YouTube, with lengths ranging from 40 to 940 seconds and an average summarization ratio of 16.39\%. Each video summary in Instruct-V2Xum is paired with a textual summary that references specific frame indexes, facilitating the generation of aligned video and textual summaries. In addition, we propose a new video summarization framework named V2Xum-LLM. V2Xum-LLM, specifically V2Xum-LLaMA in this study, is the first framework that unifies different video summarization tasks into one large language model's (LLM) text decoder and achieves task-controllable video summarization with temporal prompts and task instructions. Experiments show that V2Xum-LLaMA outperforms strong baseline models on multiple video summarization tasks. Furthermore, we propose an enhanced evaluation metric for V2V and V2VT summarization tasks.
翻译:视频摘要旨在为长视频生成简洁、准确且连贯的摘要。尽管已有多种视频摘要数据集,但其显著局限性在于源视频数量有限,这阻碍了先进大型视觉语言模型的有效微调。此外,现有数据集大多针对视频到视频摘要设计,忽略了当代对多模态视频内容摘要的需求。近年来的研究尝试从单模态扩展到多模态视频摘要,根据摘要模态将任务分为三个子任务:视频到视频(V2V)、视频到文本(V2T)以及视频与文本摘要结合(V2VT)。然而,先前多模态数据集中的文本摘要内容不够充分。为解决这些问题,我们提出跨模态视频摘要数据集Instruct-V2Xum,包含来自YouTube的30,000个多样化视频,时长范围为40至940秒,平均摘要比例为16.39%。Instruct-V2Xum中每个视频摘要都配有引用特定帧索引的文本摘要,从而促进对齐的视频与文本摘要生成。此外,我们提出新的视频摘要框架V2Xum-LLM。具体而言,本研究中使用的V2Xum-LLaMA是首个将不同视频摘要任务统一至单个大语言模型文本解码器,并通过时序提示与任务指令实现任务可控视频摘要的框架。实验表明,V2Xum-LLaMA在多个视频摘要任务上优于强基线模型。进一步地,我们为V2V和V2VT摘要任务提出增强型评估指标。