Multimodal video captioning condenses dense footage into a structured format of keyframes and natural language. By creating a cohesive multimodal summary, this approach anchors generative AI in rich semantic evidence and serves as a lightweight proxy for high-efficiency retrieval. However, traditional metrics like BLEU or ROUGE fail to quantify information coverage across disparate modalities, such as comparing a paragraph of text to a sequence of keyframes. To address this, we propose the Video Summary Information Loss (ViSIL) score, an information-theoretic framework that quantifies the video information not captured by a summary via vision-language model (VLM) inference. By measuring the information loss, ViSIL is a unified metric that enables direct comparison across multimodal summary formats despite their structural discrepancies. Our results demonstrate that ViSIL scores show a statistically significant correlation with both human and VLM performance on Video Question Answering (VQA) tasks. ViSIL also enables summary selection to optimize the trade-off between information loss and processing speed, establishing a Pareto-optimal frontier that outperforms text summaries by $7\%$ in VQA accuracy without increasing processing load.
翻译:多模态视频描述将密集的视频片段压缩为关键帧和自然语言的结构化格式。通过创建连贯的多模态摘要,该方法将生成式人工智能锚定在丰富的语义证据上,并作为高效检索的轻量级代理。然而,BLEU或ROUGE等传统指标无法量化跨不同模态的信息覆盖度,例如将一段文本与一系列关键帧进行比较。为解决此问题,我们提出了视频摘要信息损失(ViSIL)分数,这是一个信息论框架,通过视觉语言模型(VLM)推理来量化摘要未捕获的视频信息。通过测量信息损失,ViSIL是一种统一的度量标准,能够直接比较不同多模态摘要格式,尽管它们在结构上存在差异。我们的结果表明,ViSIL分数在视频问答(VQA)任务上与人类和VLM的表现均显示出统计学上显著的相关性。ViSIL还支持摘要选择,以优化信息损失与处理速度之间的权衡,从而建立了一个帕累托最优前沿,在VQA准确率上比文本摘要高出$7\%$,且未增加处理负载。