With the ever-increasing popularity of pretrained Video-Language Models (VidLMs), there is a pressing need to develop robust evaluation methodologies that delve deeper into their visio-linguistic capabilities. To address this challenge, we present ViLMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of these models on a firm footing. Task-based evaluations, while valuable, fail to capture the complexities and specific temporal aspects of moving images that VidLMs need to process. Through carefully curated counterfactuals, ViLMA offers a controlled evaluation suite that sheds light on the true potential of these models, as well as their performance gaps compared to human-level understanding. ViLMA also includes proficiency tests, which assess basic capabilities deemed essential to solving the main counterfactual tests. We show that current VidLMs' grounding abilities are no better than those of vision-language models which use static images. This is especially striking once the performance on proficiency tests is factored in. Our benchmark serves as a catalyst for future research on VidLMs, helping to highlight areas that still need to be explored.
翻译:随着预训练视频-语言模型(VidLMs)的日益普及,迫切需要开发更深入的评估方法来考察其视觉-语言能力。为应对这一挑战,我们提出ViLMA(视频语言模型评估),一种任务无关的基准测试,为评估这些模型的细粒度能力奠定坚实基础。基于任务的评估虽有价值,却无法捕捉视频-语言模型需要处理的运动图像的复杂性及特定的时间维度。通过精心设计的反事实样本,ViLMA提供了受控的评估套件,揭示了这些模型的真实潜力及其与人类理解水平之间的性能差距。ViLMA还包含能力测试,用于评估解决主要反事实测试所需的基本能力。我们的研究表明,当前视频-语言模型的时间定位能力并不优于使用静态图像的视觉-语言模型。结合能力测试的性能后,这一发现尤为显著。本基准测试将推动视频-语言模型的未来研究,有助于明确仍需探索的方向。