Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

Recent advancements in Large Language Models (LLMs) have led to the development of Video Large Multi-modal Models (Video-LMMs) that can handle a wide range of video understanding tasks. These models have the potential to be deployed in real-world applications such as robotics, AI assistants, medical imaging, and autonomous vehicles. The widespread adoption of Video-LMMs in our daily lives underscores the importance of ensuring and evaluating their robust performance in mirroring human-like reasoning and interaction capabilities in complex, real-world contexts. However, existing benchmarks for Video-LMMs primarily focus on general video comprehension abilities and neglect assessing their reasoning capabilities over complex videos in the real-world context, and robustness of these models through the lens of user prompts as text queries. In this paper, we present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES), a novel benchmark that comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions. We evaluate 9 recent models, including both open-source and closed-source variants, and find that most of the Video-LMMs, {especially open-source ones,} struggle with robustness and reasoning when dealing with complex videos. Based on our analysis, we develop a training-free Dual-Step Contextual Prompting (DSCP) technique to enhance the performance of existing Video-LMMs. Our findings provide valuable insights for building the next generation of human-centric AI systems with advanced robustness and reasoning capabilities. Our dataset and code are publicly available at: https://mbzuai-oryx.github.io/CVRR-Evaluation-Suite/.

翻译：近期大型语言模型（LLMs）的进展催生了视频大语言多模态模型（Video-LMMs）的发展，这类模型能够处理广泛的视频理解任务。它们有望被部署于机器人、AI助手、医学影像和自动驾驶等实际应用场景。随着Video-LMMs在日常生活日益普及，确保和评估其在复杂真实场景中具备类人推理与交互能力的鲁棒性能变得至关重要。然而，现有Video-LMMs基准主要关注通用视频理解能力，忽视了真实场景下复杂视频的推理能力评估，以及从用户提示词（文本查询）角度对模型鲁棒性的考察。本文提出复杂视频推理与鲁棒性评估套件（CVRR-ES），这是一个全新的基准测试，涵盖11个多样化真实世界视频维度，全面评估Video-LMMs性能。我们评估了9个近期模型（包括开源和闭源变体），发现多数Video-LMMs（尤其开源模型）在处理复杂视频时在鲁棒性和推理方面存在不足。基于分析，我们开发了一种无需训练的双步上下文提示（DSCP）技术，以提升现有Video-LMMs的性能。研究结果为构建下一代具备先进鲁棒性和推理能力的人本AI系统提供了重要见解。我们的数据集和代码已公开于：https://mbzuai-oryx.github.io/CVRR-Evaluation-Suite/。