AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering

We propose a novel and challenging benchmark, AutoEval-Video, to comprehensively evaluate large vision-language models in open-ended video question answering. The comprehensiveness of AutoEval-Video is demonstrated in two aspects: 1) AutoEval-Video constructs open-ended video-questions across 9 skill dimensions, addressing capabilities of perception, comprehension, and generation. 2) AutoEval-Video contains newly collected videos that cover over 40 distinct themes. To efficiently evaluate responses to the open-ended questions, we employ an LLM-based evaluation approach, but instead of merely providing a reference answer, we annotate unique evaluation rules for every single instance (video-question pair). To maximize the robustness of these rules, we develop a novel adversarial annotation mechanism. By using instance-specific rules as prompt, GPT-4, as an automatic evaluator, can achieve a stable evaluation accuracy of around 97.0%, comparable to the 94.9% - 97.5% accuracy of a human evaluator. Furthermore, we assess the performance of eight large vision-language models on AutoEval-Video. Among them, GPT-4V(ision) significantly outperforms other models, achieving an accuracy of 32.2%. However, there is still substantial room for improvement compared to human accuracy of 72.8%. By conducting an extensive case study, we uncover several drawbacks of GPT-4V, such as limited temporal and dynamic comprehension, and overly general responses. Code is available at https://github.com/Xiuyuan-Chen/AutoEval-Video.

翻译：我们提出了一个新颖且具有挑战性的基准测试——AutoEval-Video，旨在全面评估大视觉语言模型在开放式视频问答任务中的性能。AutoEval-Video的全面性体现在两个方面：1）AutoEval-Video构建了涵盖9个技能维度的开放式视频问题，涉及感知、理解和生成能力。2）AutoEval-Video包含了新收集的视频，覆盖超过40个不同的主题。为了高效评估对开放式问题的回答，我们采用了一种基于大语言模型的评估方法。然而，我们并非仅仅提供一个参考答案，而是为每一个具体实例（视频-问题对）标注了独特的评估规则。为了最大化这些规则的鲁棒性，我们开发了一种新颖的对抗性标注机制。通过使用实例特定的规则作为提示，GPT-4作为自动评估器，能够达到约97.0%的稳定评估准确率，这与人类评估者94.9%至97.5%的准确率相当。此外，我们在AutoEval-Video上评估了八个大视觉语言模型的性能。其中，GPT-4V(ision)显著优于其他模型，准确率达到32.2%。然而，与人类72.8%的准确率相比，仍有巨大的提升空间。通过进行广泛的案例研究，我们揭示了GPT-4V的一些缺陷，例如有限的时序和动态理解能力，以及过于笼统的回答。代码可在 https://github.com/Xiuyuan-Chen/AutoEval-Video 获取。