Despite the rapid development of video Large Language Models (LLMs), a comprehensive evaluation is still absent. In this paper, we introduce a unified evaluation that encompasses multiple video tasks, including captioning, question and answering, retrieval, and action recognition. In addition to conventional metrics, we showcase how GPT-based evaluation can match human-like performance in assessing response quality across multiple aspects. We propose a simple baseline: Video-LLaVA, which uses a single linear projection and outperforms existing video LLMs. Finally, we evaluate video LLMs beyond academic datasets, which show encouraging recognition and reasoning capabilities in driving scenarios with only hundreds of video-instruction pairs for fine-tuning. We hope our work can serve as a unified evaluation for video LLMs, and help expand more practical scenarios. The evaluation code will be available soon.
翻译:尽管视频大语言模型(LLMs)发展迅速,但目前仍缺乏全面的评估体系。本文提出了一种统一评估框架,涵盖多个视频任务,包括视频描述、问答、检索和动作识别。除传统指标外,我们展示了基于GPT的评估如何在多个维度上实现与人类相当的响应质量评估能力。我们提出了一个简单基线模型Video-LLaVA,该模型仅使用单线性投影层即可超越现有视频LLM。最后,我们在学术数据集之外对视频LLM进行了评估,结果表明仅需数百个视频-指令对进行微调,这些模型在驾驶场景中就能展现出令人鼓舞的识别与推理能力。我们希望这项工作能成为视频LLM的统一评估标准,并助力拓展更多实际应用场景。评估代码将很快开源。