Large vision-language models (LVLMs) have recently witnessed rapid advancements, exhibiting a remarkable capacity for perceiving, understanding, and processing visual information by connecting visual receptor with large language models (LLMs). However, current assessments mainly focus on recognizing and reasoning abilities, lacking direct evaluation of conversational skills and neglecting visual storytelling abilities. In this paper, we propose an evaluation method that uses strong LLMs as judges to comprehensively evaluate the various abilities of LVLMs. Firstly, we construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks. This dataset not only covers fundamental recognition and comprehension but also extends to literary creation. Secondly, by integrating detailed image annotations we effectively transform the multimodal input content into a form understandable by LLMs. This enables us to employ advanced LLMs for directly evaluating the quality of the multimodal dialogue without requiring human intervention. Through validation, we demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone, aligning with human preferences. We hope our work can serve as a touchstone for LVLMs' evaluation and pave the way for building stronger LVLMs. The evaluation code is available at https://github.com/OFA-Sys/TouchStone.
翻译:大型视觉语言模型近年来取得了快速进展,通过将视觉感知器与大语言模型相连接,展现出感知、理解和处理视觉信息的卓越能力。然而,当前评估主要侧重于识别和推理能力,缺乏对对话技能的直接评估,并且忽视了视觉叙事能力。本文提出了一种使用强语言模型作为评判者的评估方法,以全面评估大型视觉语言模型的各种能力。首先,我们构建了一个全面的视觉对话数据集TouchStone,包含开放世界的图像和问题,涵盖五大类能力和27个子任务。该数据集不仅涵盖基础的识别和理解,还扩展到文学创作。其次,通过集成详细的图像标注,我们将多模态输入内容有效转化为语言模型可理解的形式。这使我们能够利用先进的语言模型直接评估多模态对话的质量,无需人工干预。通过验证,我们表明强大的大型视觉语言模型(如GPT-4)能够仅凭其文本能力有效评分对话质量,并与人类偏好保持一致。我们希望我们的工作能成为大型视觉语言模型评估的试金石,并为构建更强的模型铺平道路。评估代码见https://github.com/OFA-Sys/TouchStone。