Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models

Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can parse natural queries about the visual content and generate human-like outputs. In this work, we explore the ability of these models to demonstrate human-like reasoning based on the perceived information. To address a crucial concern regarding the extent to which their reasoning capabilities are fully consistent and grounded, we also measure the reasoning consistency of these models. We achieve this by proposing a chain-of-thought (CoT) based consistency measure. However, such an evaluation requires a benchmark that encompasses both high-level inference and detailed reasoning chains, which is costly. We tackle this challenge by proposing a LLM-Human-in-the-Loop pipeline, which notably reduces cost while simultaneously ensuring the generation of a high-quality dataset. Based on this pipeline and the existing coarse-grained annotated dataset, we build the CURE benchmark to measure both the zero-shot reasoning performance and consistency of VLMs. We evaluate existing state-of-the-art VLMs, and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency, indicating that substantial efforts are required to enable VLMs to perform visual reasoning as systematically and consistently as humans. As an early step, we propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs. The first stage involves employing supervised fine-tuning of VLMs using step-by-step reasoning samples automatically generated by LLMs. In the second stage, we further augment the training process by incorporating feedback provided by LLMs to produce reasoning chains that are highly consistent and grounded. We empirically highlight the effectiveness of our framework in both reasoning performance and consistency.

翻译：视觉-语言模型（VLMs）近期展现出作为视觉助手的强大效能，能够解析关于视觉内容的自然语言查询并生成类人输出。本研究探索了这些模型基于感知信息展现类人推理能力，并针对其推理能力是否完全一致且具基础性的关键问题，提出了基于链式推理（CoT）的一致性度量方法。然而，此类评估需要涵盖高层推理与详细推理链的基准数据集，构建成本高昂。为解决该问题，我们提出了一种大语言模型-人在环（LLM-Human-in-the-Loop）流水线，在显著降低成本的同时确保生成高质量数据集。基于该流水线与现有粗粒度标注数据集，我们构建了CURE基准，用以评估视觉-语言模型的零样本推理性能与一致性。对现有最先进视觉-语言模型的评估显示，即使性能最优的模型也未能展现出强大的视觉推理能力与一致性，表明需要大量努力才能让视觉-语言模型像人类一样系统化且一致地执行视觉推理。作为初步探索，我们提出了一种两阶段训练框架，旨在同时提升视觉-语言模型的推理性能与一致性：第一阶段采用LLM自动生成的逐步推理样本对视觉-语言模型进行监督微调；第二阶段通过引入LLM提供的反馈进一步强化训练过程，以生成高度一致且具基础性的推理链。实验实证证明该框架在推理性能与一致性方面均具有有效性。