The explosive growth of video data intensified the need for flexible user-controllable summarization tools that operate without training data. Existing methods either rely on domain-specific datasets, limiting generalization, or cannot incorporate user intent expressed in natural language. We introduce Prompts-to-Summaries: the first zero-shot, text-queryable video-summarizer that converts off-the-shelf video-language models (VidLMs) captions into user-guided skims via large-language-models (LLMs) judging, without the use of training data, beating unsupervised and matching supervised methods. Our pipeline (i) segments video into scenes, (ii) produces scene descriptions with a memory-efficient batch prompting scheme that scales to hours on a single GPU, (iii) scores scene importance with an LLM via tailored prompts, and (iv) propagates scores to frames using new consistency (temporal coherence) and uniqueness (novelty) metrics for fine-grained frame importance. On SumMe and TVSum, our approach surpasses all prior data-hungry unsupervised methods and performs competitively on the Query-Focused Video Summarization benchmark, where the competing methods require supervised frame-level importance. We release VidSum-Reason, a query-driven dataset featuring long-tailed concepts and multi-step reasoning, where our framework serves as the first challenging baseline. Overall, we demonstrate that pretrained multi-modal models, when orchestrated with principled prompting and score propagation, provide a powerful foundation for universal, text-queryable video summarization.
翻译:视频数据的爆炸式增长加剧了对无需训练数据、灵活可控的视频摘要工具的需求。现有方法要么依赖特定领域数据集,限制了泛化能力,要么无法融入用户以自然语言表达的意图。我们提出了"从提示到摘要":首个零样本、支持文本查询的视频摘要方法,该方法通过大语言模型(LLM)的评判,将现成的视频-语言模型(VidLM)生成的描述转化为用户引导的摘要,无需使用训练数据,其性能超越了无监督方法,并与有监督方法相当。我们的流程包括:(i)将视频分割为场景,(ii)通过可扩展至单GPU处理数小时视频的高效批量提示方案生成场景描述,(iii)通过定制提示利用LLM对场景重要性进行评分,以及(iv)使用新颖的一致性(时间连贯性)和独特性(新颖性)度量将分数传播至帧级别,以获得细粒度的帧重要性。在SumMe和TVSum数据集上,我们的方法超越了所有先前依赖数据的无监督方法,并在查询聚焦视频摘要基准测试中表现出竞争力,而该基准的竞争方法需要监督式的帧级重要性标注。我们发布了VidSum-Reason数据集,这是一个包含长尾概念和多步推理的查询驱动数据集,我们的框架在其中作为首个具有挑战性的基线。总体而言,我们证明,预训练的多模态模型通过原则性提示和分数传播的协调,为通用、支持文本查询的视频摘要提供了强大的基础。