ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models

Ilker Kesen,Andrea Pedrotti,Mustafa Dogan,Michele Cafagna,Emre Can Acikgoz,Letitia Parcalabescu,Iacer Calixto,Anette Frank,Albert Gatt,Aykut Erdem,Erkut Erdem

from arxiv, Preprint. 48 pages, 22 figures, 10 tables

With the ever-increasing popularity of pretrained Video-Language Models (VidLMs), there is a pressing need to develop robust evaluation methodologies that delve deeper into their visio-linguistic capabilities. To address this challenge, we present ViLMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of these models on a firm footing. Task-based evaluations, while valuable, fail to capture the complexities and specific temporal aspects of moving images that VidLMs need to process. Through carefully curated counterfactuals, ViLMA offers a controlled evaluation suite that sheds light on the true potential of these models, as well as their performance gaps compared to human-level understanding. ViLMA also includes proficiency tests, which assess basic capabilities deemed essential to solving the main counterfactual tests. We show that current VidLMs' grounding abilities are no better than those of vision-language models which use static images. This is especially striking once the performance on proficiency tests is factored in. Our benchmark serves as a catalyst for future research on VidLMs, helping to highlight areas that still need to be explored.

翻译：随着预训练视频-语言模型（VidLMs）的日益普及，迫切需要开发更深入的评估方法来考察其视觉-语言能力。为应对这一挑战，我们提出ViLMA（视频语言模型评估），一种任务无关的基准测试，为评估这些模型的细粒度能力奠定坚实基础。基于任务的评估虽有价值，却无法捕捉视频-语言模型需要处理的运动图像的复杂性及特定的时间维度。通过精心设计的反事实样本，ViLMA提供了受控的评估套件，揭示了这些模型的真实潜力及其与人类理解水平之间的性能差距。ViLMA还包含能力测试，用于评估解决主要反事实测试所需的基本能力。我们的研究表明，当前视频-语言模型的时间定位能力并不优于使用静态图像的视觉-语言模型。结合能力测试的性能后，这一发现尤为显著。本基准测试将推动视频-语言模型的未来研究，有助于明确仍需探索的方向。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日