News videos are among the most prevalent content types produced by television stations and online streaming platforms, yet generating textual descriptions to facilitate indexing and retrieval largely remains a manual process. Video Large Language Models (VidLLMs) offer significant potential to automate this task, but a comprehensive evaluation in the news domain is still lacking. This work presents a comparative study of eight state-of-the-art open-source VidLLMs for automatic news video captioning, evaluated on two complementary benchmark datasets: a Chilean TV news corpus (approximately 1,345 clips) and a BBC News corpus (9,838 clips). We employ lexical metrics (METEOR, ROUGE-L), semantic metrics (BERTScore, CLIPScore, Text Similarity, Mean Reciprocal Rank), and two novel fidelity metrics proposed in this work: the Thematic Fidelity Score (TFS) and Entity Fidelity Score (EFS). Our analysis reveals that standard metrics exhibit limited discriminative power for news video captioning due to surface-form dependence, static-frame insensitivity, and function-word inflation. TFS and EFS address these gaps by directly assessing thematic structure preservation and named-entity coverage in the generated captions. Results show that Gemma~3 achieves the highest overall performance across both datasets and most evaluation dimensions, with Qwen-VL as a consistent runner-up.
翻译:新闻视频是电视台和在线流媒体平台制作的最常见内容类型之一,然而生成文本描述以促进索引和检索在很大程度上仍依赖人工操作。视频大语言模型(VidLLMs)在自动化完成这一任务方面具有巨大潜力,但目前尚缺乏针对新闻领域的全面评估。本研究对八种最先进的开源视频大语言模型在自动新闻视频字幕生成任务上进行了比较研究,并在两个互补的基准数据集上进行了评估:智利电视新闻语料库(约1345个片段)和BBC新闻语料库(9838个片段)。我们采用了词汇评估指标(METEOR、ROUGE-L)、语义评估指标(BERTScore、CLIPScore、文本相似度、平均倒数排名)以及本文提出的两种新型保真度指标:主题保真度评分(TFS)和实体保真度评分(EFS)。分析表明,由于标准指标存在对表面形式依赖性强、对静态帧不敏感以及功能词膨胀等问题,其在新闻视频字幕生成领域的区分能力有限。TFS和EFS通过直接评估生成字幕中主题结构保留程度和命名实体覆盖率来弥补这些不足。结果表明,Gemma~3在两个数据集及大多数评估维度上均取得了最佳综合表现,Qwen-VL则持续位居第二。