Despite significant advancements in vision-language models (VLMs), there lacks effective approaches to enhance response quality by scaling inference-time computation. This capability is known to be a core step towards the self-improving models in recent large language model studies. In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. Specifically, VisVM not only evaluates the generated sentence quality in the current search step, but also anticipates the quality of subsequent sentences that may result from the current step, thus providing a long-term value. In this way, VisVM steers VLMs away from generating sentences prone to hallucinations or insufficient detail, thereby producing higher quality responses. Experimental results demonstrate that VisVM-guided search significantly enhances VLMs' ability to generate descriptive captions with richer visual details and fewer hallucinations, compared with greedy decoding and search methods with other visual reward signals. Furthermore, we find that self-training the model with the VisVM-guided captions improve VLM's performance across a wide range of multimodal benchmarks, indicating the potential for developing self-improving VLMs. Our value model and code are available at https://github.com/si0wang/VisVM.
翻译:尽管视觉语言模型(VLMs)已取得显著进展,但目前仍缺乏通过扩展推理时计算来提升响应质量的有效方法。该能力在近期大语言模型研究中被认为是实现模型自我改进的关键步骤。本文提出视觉价值模型(VisVM),该模型能够引导VLM在推理时进行搜索,以生成具有更佳视觉理解能力的响应。具体而言,VisVM不仅评估当前搜索步骤中生成语句的质量,还能预测当前步骤可能产生的后续语句质量,从而提供长期价值评估。通过这种方式,VisVM引导VLMs避免生成易出现幻觉或细节不足的语句,进而产生更高质量的响应。实验结果表明,与贪心解码及其他视觉奖励信号引导的搜索方法相比,VisVM引导的搜索显著提升了VLMs生成描述性标题的能力,使其包含更丰富的视觉细节且幻觉更少。此外,我们发现使用VisVM引导生成的标题对模型进行自训练,能够提升VLM在广泛多模态基准测试中的性能,这预示着开发自我改进型VLMs的潜力。我们的价值模型与代码已发布于 https://github.com/si0wang/VisVM。