Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

Despite significant advancements in vision-language models (VLMs), there lacks effective approaches to enhance response quality by scaling inference-time computation. This capability is known to be a core step towards the self-improving models in recent large language model studies. In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. Specifically, VisVM not only evaluates the generated sentence quality in the current search step, but also anticipates the quality of subsequent sentences that may result from the current step, thus providing a long-term value. In this way, VisVM steers VLMs away from generating sentences prone to hallucinations or insufficient detail, thereby producing higher quality responses. Experimental results demonstrate that VisVM-guided search significantly enhances VLMs' ability to generate descriptive captions with richer visual details and fewer hallucinations, compared with greedy decoding and search methods with other visual reward signals. Furthermore, we find that self-training the model with the VisVM-guided captions improve VLM's performance across a wide range of multimodal benchmarks, indicating the potential for developing self-improving VLMs. Our value model and code are available at https://github.com/si0wang/VisVM.

翻译：尽管视觉语言模型（VLMs）已取得显著进展，但目前仍缺乏通过扩展推理时计算来提升响应质量的有效方法。该能力在近期大语言模型研究中被认为是实现模型自我改进的关键步骤。本文提出视觉价值模型（VisVM），该模型能够引导VLM在推理时进行搜索，以生成具有更佳视觉理解能力的响应。具体而言，VisVM不仅评估当前搜索步骤中生成语句的质量，还能预测当前步骤可能产生的后续语句质量，从而提供长期价值评估。通过这种方式，VisVM引导VLMs避免生成易出现幻觉或细节不足的语句，进而产生更高质量的响应。实验结果表明，与贪心解码及其他视觉奖励信号引导的搜索方法相比，VisVM引导的搜索显著提升了VLMs生成描述性标题的能力，使其包含更丰富的视觉细节且幻觉更少。此外，我们发现使用VisVM引导生成的标题对模型进行自训练，能够提升VLM在广泛多模态基准测试中的性能，这预示着开发自我改进型VLMs的潜力。我们的价值模型与代码已发布于 https://github.com/si0wang/VisVM。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日