Large Language Models as Automated Aligners for benchmarking Vision-Language Models

With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets to measure task-specific performance, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence. In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient aligners, measuring the alignment between VLMs and human intelligence and value through automatic data curation and assessment. Specifically, for data curation, Auto-Bench utilizes LLMs (e.g., GPT-4) to automatically generate a vast set of question-answer-reasoning triplets via prompting on visual symbolic representations (e.g., captions, object locations, instance relationships, and etc.). The curated data closely matches human intent, owing to the extensive world knowledge embedded in LLMs. Through this pipeline, a total of 28.5K human-verified and 3,504K unfiltered question-answer-reasoning triplets have been curated, covering 4 primary abilities and 16 sub-abilities. We subsequently engage LLMs like GPT-3.5 to serve as judges, implementing the quantitative and qualitative automated assessments to facilitate a comprehensive evaluation of VLMs. Our validation results reveal that LLMs are proficient in both evaluation data curation and model assessment, achieving an average agreement rate of 85%. We envision Auto-Bench as a flexible, scalable, and comprehensive benchmark for evaluating the evolving sophisticated VLMs.

翻译：随着大型语言模型（LLM）的进步，视觉语言模型（VLM）已达到新的复杂水平，在复杂认知与推理任务中展现出显著能力。然而，现有评估基准主要依赖刚性的人工构建数据集来衡量特定任务性能，在评估这些日益拟人化模型与人类智能的对齐程度时存在显著局限性。本研究通过Auto-Bench克服这些局限，探索将LLM作为高效对齐器，通过自动数据整理与评估来测量VLM与人类智能及价值的对齐程度。具体而言，在数据整理方面，Auto-Bench利用LLM（如GPT-4）通过对视觉符号表征（如描述文本、物体位置、实例关系等）进行提示，自动生成海量问答-推理三元组。得益于LLM蕴含的广泛世界知识，所整理数据与人类意图高度吻合。通过该流程，共整理出28.5K个人工验证及3,504K个未过滤的问答-推理三元组，覆盖4项主要能力与16项子能力。随后，我们引入GPT-3.5等LLM作为评估裁判，实施定量与定性自动化评估以促进VLM的全面评测。验证结果表明，LLM在评估数据整理与模型评估两方面均表现优异，平均一致率达85%。我们预期Auto-Bench将成为评估不断演进复杂VLM的灵活、可扩展且全面的基准。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日