We propose a novel and challenging benchmark, AutoEval-Video, to comprehensively evaluate large vision-language models in open-ended video question answering. The comprehensiveness of AutoEval-Video is demonstrated in two aspects: 1) AutoEval-Video constructs open-ended video-questions across 9 skill dimensions, addressing capabilities of perception, comprehension, and generation. 2) AutoEval-Video contains newly collected videos that cover over 40 distinct themes. To efficiently evaluate responses to the open-ended questions, we employ an LLM-based evaluation approach, but instead of merely providing a reference answer, we annotate unique evaluation rules for every single instance (video-question pair). To maximize the robustness of these rules, we develop a novel adversarial annotation mechanism. By using instance-specific rules as prompt, GPT-4, as an automatic evaluator, can achieve a stable evaluation accuracy of around 97.0%, comparable to the 94.9% - 97.5% accuracy of a human evaluator. Furthermore, we assess the performance of eight large vision-language models on AutoEval-Video. Among them, GPT-4V(ision) significantly outperforms other models, achieving an accuracy of 32.2%. However, there is still substantial room for improvement compared to human accuracy of 72.8%. By conducting an extensive case study, we uncover several drawbacks of GPT-4V, such as limited temporal and dynamic comprehension, and overly general responses. Code is available at https://github.com/Xiuyuan-Chen/AutoEval-Video.
翻译:我们提出了一个新颖且具有挑战性的基准测试——AutoEval-Video,旨在全面评估大视觉语言模型在开放式视频问答任务中的性能。AutoEval-Video的全面性体现在两个方面:1)AutoEval-Video构建了涵盖9个技能维度的开放式视频问题,涉及感知、理解和生成能力。2)AutoEval-Video包含了新收集的视频,覆盖超过40个不同的主题。为了高效评估对开放式问题的回答,我们采用了一种基于大语言模型的评估方法。然而,我们并非仅仅提供一个参考答案,而是为每一个具体实例(视频-问题对)标注了独特的评估规则。为了最大化这些规则的鲁棒性,我们开发了一种新颖的对抗性标注机制。通过使用实例特定的规则作为提示,GPT-4作为自动评估器,能够达到约97.0%的稳定评估准确率,这与人类评估者94.9%至97.5%的准确率相当。此外,我们在AutoEval-Video上评估了八个大视觉语言模型的性能。其中,GPT-4V(ision)显著优于其他模型,准确率达到32.2%。然而,与人类72.8%的准确率相比,仍有巨大的提升空间。通过进行广泛的案例研究,我们揭示了GPT-4V的一些缺陷,例如有限的时序和动态理解能力,以及过于笼统的回答。代码可在 https://github.com/Xiuyuan-Chen/AutoEval-Video 获取。