DeVAn: Dense Video Annotation for Video-Language Models

We present a novel human annotated dataset for evaluating the ability for visual-language models to generate both short and long descriptions for real-world video clips, termed DeVAn (Dense Video Annotation). The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests. Each video clip is independently annotated by 5 human annotators, producing both captions (1 sentence) and summaries (3-10 sentences). Given any video selected from the dataset and its corresponding ASR information, we evaluate visuallanguage models on either caption or summary generation that is grounded in both the visual and auditory content of the video. Additionally, models are also evaluated on caption- and summary-based retrieval tasks, where the summary-based retrieval task requires the identification of a target video given excerpts of a given summary. Given the novel nature of the paragraph-length video summarization task, we compared different existing evaluation metrics and their alignment with human preferences and found that model-based evaluation metrics provide more semantically-oriented and human-aligned evaluation. Finally, we benchmarked a wide range of current video-language models on DeVAn, and we aim for DeVAn to serve as a useful evaluation set in the age of large language models and complex multi-modal tasks. Code is available at https: //github.com/TK-21st/DeVAn.

翻译：本文提出了一种新颖的人工标注数据集，用于评估视觉-语言模型为真实世界视频片段生成简短描述与长篇描述的能力，该数据集被命名为DeVAn（密集视频标注）。该数据集包含8.5K个时长20-60秒的YouTube视频片段，涵盖广泛的主题与兴趣领域。每个视频片段均由5位标注者独立标注，生成字幕（1句话）与摘要（3-10句话）。给定数据集中任意选取的视频及其对应的自动语音识别信息，我们在基于视频视觉与听觉内容的基础上，评估视觉-语言模型在字幕生成或摘要生成任务上的表现。此外，模型还在基于字幕和基于摘要的检索任务上进行评估，其中基于摘要的检索任务要求根据给定摘要的片段识别目标视频。鉴于段落级视频摘要任务的新颖性，我们比较了不同现有评估指标与人类偏好的对齐程度，发现基于模型的评估指标能提供更语义导向且更符合人类偏好的评估结果。最后，我们在DeVAn上对当前多种视频-语言模型进行了基准测试，并期望DeVAn能在大型语言模型与复杂多模态任务时代成为一个有用的评估数据集。代码发布于 https://github.com/TK-21st/DeVAn。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日