Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning; adoption that has fueled a wealth of new models such as LLaVa, InstructBLIP, and PaLI-3. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored, making it challenging to understand what factors account for model performance $-$ a challenge further complicated by the lack of objective, consistent evaluations. To address these gaps, we first compile a suite of standardized evaluations spanning visual question answering, object localization, and challenge sets that probe properties such as hallucination; evaluations that provide fine-grained insight VLM capabilities. Second, we rigorously investigate VLMs along key design axes, including pretrained visual representations and training from base vs. instruct-tuned language models, amongst others. We couple our analysis with three resource contributions: (1) a unified framework for evaluating VLMs, (2) optimized, flexible training code, and (3) checkpoints for all models, including a family of VLMs at the 7-13B scale that strictly outperform InstructBLIP and LLaVa v1.5, the state-of-the-art in open VLMs.
翻译:视觉条件语言模型(VLMs)在视觉对话、场景理解和机器人任务规划等应用中的采用日益广泛,这种采用催生了诸如LLaVa、InstructBLIP和PaLI-3等一系列新模型。尽管新模型层出不穷,围绕图像预处理、架构设计和优化策略的关键设计决策仍未得到充分探索,导致难以厘清哪些因素真正决定模型性能——这一问题因缺乏客观、一致的评估标准而进一步复杂化。为填补这些空白,我们首先构建了一套标准化评估体系,涵盖视觉问答、目标定位以及专门探测幻觉等特性的挑战集,这些评估能精细揭示VLM的核心能力。其次,我们沿关键设计维度对VLM进行系统研究,包括预训练视觉表征的选取、基于原始语言模型与指令微调语言模型的训练策略等。本研究同步贡献三项资源:(1)统一的VLM评估框架;(2)经过优化的灵活训练代码;(3)涵盖所有模型的检查点,其中包括一个7-13B参数规模的VLM系列,其性能全面超越当前开源VLM领域的最先进模型InstructBLIP和LLaVa v1.5。