Evaluating the alignment capabilities of large Vision-Language Models (VLMs) is essential for determining their effectiveness as helpful assistants. However, existing benchmarks primarily focus on basic abilities using nonverbal methods, such as yes-no and multiple-choice questions. In this paper, we address this gap by introducing AlignMMBench, a comprehensive alignment benchmark specifically designed for emerging Chinese VLMs. This benchmark is meticulously curated from real-world scenarios and Chinese Internet sources, encompassing thirteen specific tasks across three categories, and includes both single-turn and multi-turn dialogue scenarios. Incorporating a prompt rewrite strategy, AlignMMBench encompasses 1,054 images and 4,978 question-answer pairs. To facilitate the evaluation pipeline, we propose CritiqueVLM, a rule-calibrated evaluator that exceeds GPT-4's evaluation ability. Finally, we report the performance of representative VLMs on AlignMMBench, offering insights into the capabilities and limitations of different VLM architectures. All evaluation codes and data are available on https://alignmmbench.github.io.
翻译:评估大型视觉语言模型的对齐能力对于确定其作为有效助手的性能至关重要。然而,现有基准测试主要侧重于使用非语言方法(如是非题和选择题)评估基础能力。本文通过引入AlignMMBench来填补这一空白,这是一个专门为新兴中文视觉语言模型设计的综合性对齐基准。该基准从真实场景和中文互联网资源中精心构建,涵盖三大类别的十三项具体任务,并包含单轮和多轮对话场景。通过采用提示词改写策略,AlignMMBench共包含1,054张图像和4,978组问答对。为构建高效评估流程,我们提出了CritiqueVLM——一个经过规则校准的评估器,其评估能力超越GPT-4。最后,我们报告了代表性视觉语言模型在AlignMMBench上的性能表现,为不同模型架构的能力与局限提供了深入洞见。所有评估代码与数据均公开于https://alignmmbench.github.io。