Vision Language Models (VLMs) demonstrate remarkable proficiency in addressing a wide array of visual questions, which requires strong perception and reasoning faculties. Assessing these two competencies independently is crucial for model refinement, despite the inherent difficulty due to the intertwined nature of seeing and reasoning in existing VLMs. To tackle this issue, we present Prism, an innovative framework designed to disentangle the perception and reasoning processes involved in visual question solving. Prism comprises two distinct stages: a perception stage that utilizes a VLM to extract and articulate visual information in textual form, and a reasoning stage that formulates responses based on the extracted visual information using a Large Language Model (LLM). This modular design enables the systematic comparison and assessment of both proprietary and open-source VLM for their perception and reasoning strengths. Our analytical framework provides several valuable insights, underscoring Prism's potential as a cost-effective solution for vision-language tasks. By combining a streamlined VLM focused on perception with a powerful LLM tailored for reasoning, Prism achieves superior results in general vision-language tasks while substantially cutting down on training and operational expenses. Quantitative evaluations show that Prism, when configured with a vanilla 2B LLaVA and freely accessible GPT-3.5, delivers performance on par with VLMs $10 \times$ larger on the rigorous multimodal benchmark MMStar. The project is released at: https://github.com/SparksJoe/Prism.
翻译:视觉语言模型(VLMs)在解决各类视觉问题上展现出卓越的能力,这要求模型具备强大的感知与推理能力。尽管现有VLMs中视觉感知与逻辑推理过程紧密耦合,但独立评估这两种能力对于模型优化至关重要。为解决这一问题,我们提出棱镜(Prism)——一个创新性框架,旨在解耦视觉问题求解中的感知与推理过程。该框架包含两个独立阶段:感知阶段利用VLM以文本形式提取并表述视觉信息;推理阶段则基于提取的视觉信息,通过大型语言模型(LLM)生成回答。这种模块化设计能够系统性地比较和评估专有及开源VLM在感知与推理方面的能力。我们的分析框架提供了若干重要发现,彰显了Prism作为视觉语言任务高效解决方案的潜力。通过将专注于感知的轻量化VLM与擅长推理的强大LLM相结合,Prism在通用视觉语言任务中取得优异性能,同时显著降低了训练与部署成本。定量评估表明,当配置基础版20亿参数LLaVA模型与可公开访问的GPT-3.5时,Prism在严格的多模态基准测试MMStar上达到了十倍参数量级VLM的同等性能。项目已发布于:https://github.com/SparksJoe/Prism。