Prompts to Summaries: Zero-Shot Language-Guided Video Summarization with Large Language and Video Models

The explosive growth of video data intensified the need for flexible user-controllable summarization tools that operate without training data. Existing methods either rely on domain-specific datasets, limiting generalization, or cannot incorporate user intent expressed in natural language. We introduce Prompts-to-Summaries: the first zero-shot, text-queryable video-summarizer that converts off-the-shelf video-language models (VidLMs) captions into user-guided skims via large-language-models (LLMs) judging, without the use of training data, beating unsupervised and matching supervised methods. Our pipeline (i) segments video into scenes, (ii) produces scene descriptions with a memory-efficient batch prompting scheme that scales to hours on a single GPU, (iii) scores scene importance with an LLM via tailored prompts, and (iv) propagates scores to frames using new consistency (temporal coherence) and uniqueness (novelty) metrics for fine-grained frame importance. On SumMe and TVSum, our approach surpasses all prior data-hungry unsupervised methods and performs competitively on the Query-Focused Video Summarization benchmark, where the competing methods require supervised frame-level importance. We release VidSum-Reason, a query-driven dataset featuring long-tailed concepts and multi-step reasoning, where our framework serves as the first challenging baseline. Overall, we demonstrate that pretrained multi-modal models, when orchestrated with principled prompting and score propagation, provide a powerful foundation for universal, text-queryable video summarization.

翻译：视频数据的爆炸式增长加剧了对无需训练数据、灵活可控的视频摘要工具的需求。现有方法要么依赖特定领域数据集，限制了泛化能力，要么无法融入用户以自然语言表达的意图。我们提出了"从提示到摘要"：首个零样本、支持文本查询的视频摘要方法，该方法通过大语言模型（LLM）的评判，将现成的视频-语言模型（VidLM）生成的描述转化为用户引导的摘要，无需使用训练数据，其性能超越了无监督方法，并与有监督方法相当。我们的流程包括：（i）将视频分割为场景，（ii）通过可扩展至单GPU处理数小时视频的高效批量提示方案生成场景描述，（iii）通过定制提示利用LLM对场景重要性进行评分，以及（iv）使用新颖的一致性（时间连贯性）和独特性（新颖性）度量将分数传播至帧级别，以获得细粒度的帧重要性。在SumMe和TVSum数据集上，我们的方法超越了所有先前依赖数据的无监督方法，并在查询聚焦视频摘要基准测试中表现出竞争力，而该基准的竞争方法需要监督式的帧级重要性标注。我们发布了VidSum-Reason数据集，这是一个包含长尾概念和多步推理的查询驱动数据集，我们的框架在其中作为首个具有挑战性的基线。总体而言，我们证明，预训练的多模态模型通过原则性提示和分数传播的协调，为通用、支持文本查询的视频摘要提供了强大的基础。