TempCompass: Do Video LLMs Really Understand Videos?

Recently, there is a surge in interest surrounding video large language models (Video LLMs). However, existing benchmarks fail to provide a comprehensive feedback on the temporal perception ability of Video LLMs. On the one hand, most of them are unable to distinguish between different temporal aspects (e.g., speed, direction) and thus cannot reflect the nuanced performance on these specific aspects. On the other hand, they are limited in the diversity of task formats (e.g., only multi-choice QA), which hinders the understanding of how temporal perception performance may vary across different types of tasks. Motivated by these two problems, we propose the \textbf{TempCompass} benchmark, which introduces a diversity of temporal aspects and task formats. To collect high-quality test data, we devise two novel strategies: (1) In video collection, we construct conflicting videos that share the same static content but differ in a specific temporal aspect, which prevents Video LLMs from leveraging single-frame bias or language priors. (2) To collect the task instructions, we propose a paradigm where humans first annotate meta-information for a video and then an LLM generates the instruction. We also design an LLM-based approach to automatically and accurately evaluate the responses from Video LLMs. Based on TempCompass, we comprehensively evaluate 8 state-of-the-art (SOTA) Video LLMs and 3 Image LLMs, and reveal the discerning fact that these models exhibit notably poor temporal perception ability. Our data will be available at https://github.com/llyx97/TempCompass.

翻译：近来，视频大语言模型（Video LLMs）引起了广泛关注。然而，现有基准测试未能全面评估视频大语言模型的时间感知能力。一方面，大多数基准测试无法区分不同的时间维度（如速度、方向），因而无法反映模型在这些特定维度上的细微性能差异。另一方面，它们在任务格式的多样性上存在局限（例如仅包含多项选择题），这阻碍了我们对时间感知能力在不同任务类型中如何变化的理解。受这两个问题启发，我们提出了 \textbf{TempCompass} 基准测试，该基准引入了多样化的时间维度和任务格式。为了收集高质量的测试数据，我们设计了两种新颖的策略：（1）在视频收集方面，我们构建了共享相同静态内容但在特定时间维度上存在差异的冲突视频，这防止了视频大语言模型利用单帧偏差或语言先验。（2）为了收集任务指令，我们提出了一种范式，即人类首先为视频标注元信息，然后由一个大语言模型生成指令。我们还设计了一种基于大语言模型的方法，以自动、准确地评估视频大语言模型的响应。基于 TempCompass，我们全面评估了 8 个最先进的视频大语言模型和 3 个图像大语言模型，并揭示了一个显著事实：这些模型表现出明显较差的时间感知能力。我们的数据将在 https://github.com/llyx97/TempCompass 上公开。