With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets to measure task-specific performance, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence. In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient aligners, measuring the alignment between VLMs and human intelligence and value through automatic data curation and assessment. Specifically, for data curation, Auto-Bench utilizes LLMs (e.g., GPT-4) to automatically generate a vast set of question-answer-reasoning triplets via prompting on visual symbolic representations (e.g., captions, object locations, instance relationships, and etc.). The curated data closely matches human intent, owing to the extensive world knowledge embedded in LLMs. Through this pipeline, a total of 28.5K human-verified and 3,504K unfiltered question-answer-reasoning triplets have been curated, covering 4 primary abilities and 16 sub-abilities. We subsequently engage LLMs like GPT-3.5 to serve as judges, implementing the quantitative and qualitative automated assessments to facilitate a comprehensive evaluation of VLMs. Our validation results reveal that LLMs are proficient in both evaluation data curation and model assessment, achieving an average agreement rate of 85%. We envision Auto-Bench as a flexible, scalable, and comprehensive benchmark for evaluating the evolving sophisticated VLMs.
翻译:随着大型语言模型(LLM)的进步,视觉语言模型(VLM)已达到新的复杂水平,在复杂认知与推理任务中展现出显著能力。然而,现有评估基准主要依赖刚性的人工构建数据集来衡量特定任务性能,在评估这些日益拟人化模型与人类智能的对齐程度时存在显著局限性。本研究通过Auto-Bench克服这些局限,探索将LLM作为高效对齐器,通过自动数据整理与评估来测量VLM与人类智能及价值的对齐程度。具体而言,在数据整理方面,Auto-Bench利用LLM(如GPT-4)通过对视觉符号表征(如描述文本、物体位置、实例关系等)进行提示,自动生成海量问答-推理三元组。得益于LLM蕴含的广泛世界知识,所整理数据与人类意图高度吻合。通过该流程,共整理出28.5K个人工验证及3,504K个未过滤的问答-推理三元组,覆盖4项主要能力与16项子能力。随后,我们引入GPT-3.5等LLM作为评估裁判,实施定量与定性自动化评估以促进VLM的全面评测。验证结果表明,LLM在评估数据整理与模型评估两方面均表现优异,平均一致率达85%。我们预期Auto-Bench将成为评估不断演进复杂VLM的灵活、可扩展且全面的基准。