The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI's ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses 6 core competencies that focus on perceptivity and interactivity, encompassing 1,000 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues. The benchmark will be publicly available to foster future research.
翻译:近年来,多模态大语言模型的发展显著提升了人工智能理解视觉模态的能力。然而,现有的评估基准仍局限于单轮问答,忽视了现实场景中多轮对话的复杂性。为弥补这一差距,我们提出了MT-Video-Bench,一个用于评估多模态大语言模型在多轮对话中视频理解能力的综合性基准。具体而言,我们的MT-Video-Bench主要评估聚焦于感知力与交互性的六项核心能力,涵盖了从多个领域精心构建的1000个多轮对话。这些能力与现实应用场景(如交互式体育分析和基于视频的多轮智能教学)严格对齐。利用MT-Video-Bench,我们对多种先进的开源与闭源多模态大语言模型进行了广泛评估,揭示了它们在处理多轮视频对话时存在的显著性能差异与局限。该基准将公开提供,以促进未来研究。