With the rapid proliferation of video content across social media, surveillance, and education platforms, efficiently summarizing long videos into concise yet semantically faithful surrogates has become increasingly vital. Existing supervised methods achieve strong in-domain accuracy by learning from dense annotations but suffer from high labeling costs and limited cross-dataset generalization, while unsupervised approaches, though label-free, often fail to capture high-level human semantics and fine-grained narrative cues. More recently, zero-shot prompting pipelines have leveraged large language models (LLMs) for training-free video summarization, yet remain highly sensitive to handcrafted prompt templates and dataset-specific score normalization. To overcome these limitations, we introduce a rubric-guided, pseudo-labeled prompting framework that transforms a small subset of ground-truth annotations into high-confidence pseudo labels, which are aggregated into structured, dataset-adaptive scoring rubrics guiding interpretable scene evaluation. During inference, first and last segments are scored based solely on their descriptions, whereas intermediate ones incorporate brief contextual summaries of adjacent scenes to assess narrative progression and redundancy. This contextual prompting enables the LLM to balance local salience and global coherence without parameter tuning. On SumMe and TVSum, our method achieves F1 scores of \textbf{57.58} and \textbf{63.05}, surpassing unsupervised and prior zero-shot baselines while approaching supervised performance. The results demonstrate that rubric-guided pseudo labeling effectively stabilizes LLM-based scoring and establishes a general, interpretable zero-shot paradigm for video summarization.
翻译:随着视频内容在社交媒体、监控和教育平台上的迅速扩散,将长视频高效地概括为简洁且语义忠实的替代内容变得日益重要。现有的监督方法通过学习密集标注实现了较强的域内准确性,但存在标注成本高和跨数据集泛化能力有限的问题;而无监督方法虽然无需标注,却往往难以捕捉高层次的人类语义和细粒度的叙事线索。最近,零样本提示流程利用大型语言模型(LLM)进行无需训练的视频摘要,但仍对手工设计的提示模板和数据集特定的分数归一化高度敏感。为克服这些限制,我们提出了一种基于评分标准的伪标签提示框架,将一小部分真实标注转化为高置信度的伪标签,并聚合为结构化的、适应数据集的评分标准,以指导可解释的场景评估。在推理过程中,首尾片段仅基于其描述进行评分,而中间片段则结合相邻场景的简要上下文摘要,以评估叙事进展和冗余度。这种上下文提示使LLM能够在无需参数调优的情况下平衡局部显著性和全局连贯性。在SumMe和TVSum数据集上,我们的方法取得了**57.58**和**63.05**的F1分数,超越了无监督和先前的零样本基线方法,同时接近监督方法的性能。结果表明,基于评分标准的伪标签有效稳定了基于LLM的评分,并为视频摘要建立了一个通用、可解释的零样本范式。