Video Large Language Models (Video-LLMs) are flourishing and has advanced many video-language tasks. As a golden testbed, Video Question Answering (VideoQA) plays pivotal role in Video-LLM developing. This work conducts a timely and comprehensive study of Video-LLMs' behavior in VideoQA, aiming to elucidate their success and failure modes, and provide insights towards more human-like video understanding and question answering. Our analyses demonstrate that Video-LLMs excel in VideoQA; they can correlate contextual cues and generate plausible responses to questions about varied video contents. However, models falter in handling video temporality, both in reasoning about temporal content ordering and grounding QA-relevant temporal moments. Moreover, the models behave unintuitively - they are unresponsive to adversarial video perturbations while being sensitive to simple variations of candidate answers and questions. Also, they do not necessarily generalize better. The findings demonstrate Video-LLMs' QA capability in standard condition yet highlight their severe deficiency in robustness and interpretability, suggesting the urgent need on rationales in Video-LLM developing.
翻译:视频大语言模型(Video-LLMs)正在蓬勃发展,并推动了许多视频-语言任务的进步。作为一项黄金测试基准,视频问答(VideoQA)在视频大语言模型的开发中扮演着关键角色。本研究对视频大语言模型在视频问答任务中的行为进行了及时且全面的考察,旨在阐明其成功与失败的模式,并为实现更类人的视频理解与问答提供洞见。我们的分析表明,视频大语言模型在视频问答任务上表现出色;它们能够关联上下文线索,并对涉及多样化视频内容的问题生成合理的回答。然而,模型在处理视频时序性方面存在不足,既包括对时序内容排序的推理,也包括对问答相关时间片段的定位。此外,模型的行为表现出非直觉性——它们对对抗性的视频扰动不敏感,却对候选答案和问题的简单变化非常敏感。同时,它们的泛化能力也未必更强。这些发现证明了视频大语言模型在标准条件下的问答能力,但也凸显了其在鲁棒性和可解释性方面的严重缺陷,表明在视频大语言模型开发中亟需引入理性机制。