We present a novel human annotated dataset for evaluating the ability for visual-language models to generate both short and long descriptions for real-world video clips, termed DeVAn (Dense Video Annotation). The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests. Each video clip is independently annotated by 5 human annotators, producing both captions (1 sentence) and summaries (3-10 sentences). Given any video selected from the dataset and its corresponding ASR information, we evaluate visuallanguage models on either caption or summary generation that is grounded in both the visual and auditory content of the video. Additionally, models are also evaluated on caption- and summary-based retrieval tasks, where the summary-based retrieval task requires the identification of a target video given excerpts of a given summary. Given the novel nature of the paragraph-length video summarization task, we compared different existing evaluation metrics and their alignment with human preferences and found that model-based evaluation metrics provide more semantically-oriented and human-aligned evaluation. Finally, we benchmarked a wide range of current video-language models on DeVAn, and we aim for DeVAn to serve as a useful evaluation set in the age of large language models and complex multi-modal tasks. Code is available at https: //github.com/TK-21st/DeVAn.
翻译:本文提出了一种新颖的人工标注数据集,用于评估视觉-语言模型为真实世界视频片段生成简短描述与长篇描述的能力,该数据集被命名为DeVAn(密集视频标注)。该数据集包含8.5K个时长20-60秒的YouTube视频片段,涵盖广泛的主题与兴趣领域。每个视频片段均由5位标注者独立标注,生成字幕(1句话)与摘要(3-10句话)。给定数据集中任意选取的视频及其对应的自动语音识别信息,我们在基于视频视觉与听觉内容的基础上,评估视觉-语言模型在字幕生成或摘要生成任务上的表现。此外,模型还在基于字幕和基于摘要的检索任务上进行评估,其中基于摘要的检索任务要求根据给定摘要的片段识别目标视频。鉴于段落级视频摘要任务的新颖性,我们比较了不同现有评估指标与人类偏好的对齐程度,发现基于模型的评估指标能提供更语义导向且更符合人类偏好的评估结果。最后,我们在DeVAn上对当前多种视频-语言模型进行了基准测试,并期望DeVAn能在大型语言模型与复杂多模态任务时代成为一个有用的评估数据集。代码发布于 https://github.com/TK-21st/DeVAn。