Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

from arxiv, 22 pages, 11 figures. Training code and models: https://github.com/TRI-ML/prismatic-vlms. Evaluation code: https://github.com/TRI-ML/vlm-evaluation

Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning; adoption that has fueled a wealth of new models such as LLaVa, InstructBLIP, and PaLI-3. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored, making it challenging to understand what factors account for model performance $-$ a challenge further complicated by the lack of objective, consistent evaluations. To address these gaps, we first compile a suite of standardized evaluations spanning visual question answering, object localization from language, and targeted challenge sets that probe properties such as hallucination; evaluations that provide calibrated, fine-grained insight into a VLM's capabilities. Second, we rigorously investigate VLMs along key design axes, including pretrained visual representations and quantifying the tradeoffs of using base vs. instruct-tuned language models, amongst others. We couple our analysis with three resource contributions: (1) a unified framework for evaluating VLMs, (2) optimized, flexible code for VLM training, and (3) checkpoints for all models, including a family of VLMs at the 7-13B scale that strictly outperform InstructBLIP and LLaVa v1.5, the state-of-the-art in open-source VLMs.

翻译：视觉条件语言模型（VLM）在视觉对话、场景理解与机器人任务规划等应用中的采用日益广泛，这一趋势催生了LLaVa、InstructBLIP和PaLI-3等众多新型模型。尽管新模型层出不穷，但关于图像预处理、架构设计和优化策略的关键设计决策仍未被充分探索，导致难以理解模型性能的关键影响因素——这一困境因缺乏客观、一致的评估标准而进一步加剧。为填补这些空白，我们首先构建了一套标准化评估体系，涵盖视觉问答、基于语言的目标定位以及专门探测幻觉等特性的挑战性数据集；这些评估能提供经过标定且细粒度的VLM能力洞察。其次，我们沿关键设计轴对VLM进行严谨探究，包括预训练视觉表征、基础语言模型与指令调优语言模型的使用权衡等。我们将分析工作与三项资源贡献相结合：(1) 统一的VLM评估框架，(2) 经优化的灵活VLM训练代码，(3) 所有模型的检查点，其中包括7-13B参数规模的一族VLM模型，其性能严格优于开放源码VLM领域的先进模型InstructBLIP和LLaVa v1.5。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日