Evaluating the alignment capabilities of large Vision-Language Models (VLMs) is essential for determining their effectiveness as helpful assistants. However, existing benchmarks primarily focus on basic abilities using nonverbal methods, such as yes-no and multiple-choice questions. In this paper, we address this gap by introducing AlignMMBench, a comprehensive alignment benchmark specifically designed for emerging Chinese VLMs. This benchmark is meticulously curated from real-world scenarios and Chinese Internet sources, encompassing thirteen specific tasks across three categories, and includes both single-turn and multi-turn dialogue scenarios. Incorporating a prompt rewrite strategy, AlignMMBench encompasses 1,054 images and 4,978 question-answer pairs. To facilitate the evaluation pipeline, we propose CritiqueVLM, a rule-calibrated evaluator that exceeds GPT-4's evaluation ability. Finally, we report the performance of representative VLMs on AlignMMBench, offering insights into the capabilities and limitations of different VLM architectures. All evaluation codes and data are available on https://alignmmbench.github.io.
翻译:评估大型视觉-语言模型(VLM)的对齐能力对于确定其作为有用助手的有效性至关重要。然而,现有基准主要关注基于非语言方法的基本能力,例如是非题和多项选择题。在本文中,我们通过引入AlignMMBench填补了这一空白,这是一个专门针对新兴中文VLM设计的全面对齐基准。该基准从真实场景和中文互联网资源中精心整理,涵盖三个类别的十三项具体任务,并包括单轮和多轮对话场景。结合提示重写策略,AlignMMBench包含1,054张图像和4,978个问答对。为简化评估流程,我们提出了CritiqueVLM,一种规则校准的评估器,其评估能力超越了GPT-4。最后,我们报告了代表性VLM在AlignMMBench上的性能,为不同VLM架构的能力和局限性提供了见解。所有评估代码和数据均可在https://alignmmbench.github.io获取。