V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

Video summarization aims to create short, accurate, and cohesive summaries of longer videos. Despite the existence of various video summarization datasets, a notable limitation is their limited amount of source videos, which hampers the effective fine-tuning of advanced large vision-language models (VLMs). Additionally, most existing datasets are created for video-to-video summarization, overlooking the contemporary need for multimodal video content summarization. Recent efforts have been made to expand from unimodal to multimodal video summarization, categorizing the task into three sub-tasks based on the summary's modality: video-to-video (V2V), video-to-text (V2T), and a combination of video and text summarization (V2VT). However, the textual summaries in previous multimodal datasets are inadequate. To address these issues, we introduce Instruct-V2Xum, a cross-modal video summarization dataset featuring 30,000 diverse videos sourced from YouTube, with lengths ranging from 40 to 940 seconds and an average summarization ratio of 16.39\%. Each video summary in Instruct-V2Xum is paired with a textual summary that references specific frame indexes, facilitating the generation of aligned video and textual summaries. In addition, we propose a new video summarization framework named V2Xum-LLM. V2Xum-LLM, specifically V2Xum-LLaMA in this study, is the first framework that unifies different video summarization tasks into one large language model's (LLM) text decoder and achieves task-controllable video summarization with temporal prompts and task instructions. Experiments show that V2Xum-LLaMA outperforms strong baseline models on multiple video summarization tasks. Furthermore, we propose an enhanced evaluation metric for V2V and V2VT summarization tasks.

翻译：视频摘要旨在为长视频生成简洁、准确且连贯的摘要。尽管已有多种视频摘要数据集，但其显著局限性在于源视频数量有限，这阻碍了先进大型视觉语言模型的有效微调。此外，现有数据集大多针对视频到视频摘要设计，忽略了当代对多模态视频内容摘要的需求。近年来的研究尝试从单模态扩展到多模态视频摘要，根据摘要模态将任务分为三个子任务：视频到视频（V2V）、视频到文本（V2T）以及视频与文本摘要结合（V2VT）。然而，先前多模态数据集中的文本摘要内容不够充分。为解决这些问题，我们提出跨模态视频摘要数据集Instruct-V2Xum，包含来自YouTube的30,000个多样化视频，时长范围为40至940秒，平均摘要比例为16.39%。Instruct-V2Xum中每个视频摘要都配有引用特定帧索引的文本摘要，从而促进对齐的视频与文本摘要生成。此外，我们提出新的视频摘要框架V2Xum-LLM。具体而言，本研究中使用的V2Xum-LLaMA是首个将不同视频摘要任务统一至单个大语言模型文本解码器，并通过时序提示与任务指令实现任务可控视频摘要的框架。实验表明，V2Xum-LLaMA在多个视频摘要任务上优于强基线模型。进一步地，我们为V2V和V2VT摘要任务提出增强型评估指标。