Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models

Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can parse natural queries about the visual content and generate human-like outputs. In this work, we explore the ability of these models to demonstrate human-like reasoning based on the perceived information. To address a crucial concern regarding the extent to which their reasoning capabilities are fully consistent and grounded, we also measure the reasoning consistency of these models. We achieve this by proposing a chain-of-thought (CoT) based consistency measure. However, such an evaluation requires a benchmark that encompasses both high-level inference and detailed reasoning chains, which is costly. We tackle this challenge by proposing a LLM-Human-in-the-Loop pipeline, which notably reduces cost while simultaneously ensuring the generation of a high-quality dataset. Based on this pipeline and the existing coarse-grained annotated dataset, we build the CURE benchmark to measure both the zero-shot reasoning performance and consistency of VLMs. We evaluate existing state-of-the-art VLMs, and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency, indicating that substantial efforts are required to enable VLMs to perform visual reasoning as systematically and consistently as humans. As an early step, we propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs. The first stage involves employing supervised fine-tuning of VLMs using step-by-step reasoning samples automatically generated by LLMs. In the second stage, we further augment the training process by incorporating feedback provided by LLMs to produce reasoning chains that are highly consistent and grounded. We empirically highlight the effectiveness of our framework in both reasoning performance and consistency.

翻译：视觉语言模型（VLM）近期在作为视觉助手方面展现出强大效能，能够解析关于视觉内容的自然语言查询并生成类人输出。本研究旨在探索这些模型基于感知信息展现类人推理的能力。针对其推理能力是否完全一致且可解释的关键问题，我们同时测量了这些模型的推理一致性。为此，我们提出了一种基于链式思维（Chain-of-Thought, CoT）的一致性度量方法。然而，此类评估需要同时包含高层推理与详细推理链的基准数据集，成本高昂。为应对这一挑战，我们提出了一种大语言模型-人类在环（LLM-Human-in-the-Loop）流水线，该方案显著降低数据构建成本，同时确保高质量数据集的生成。基于该流水线与现有粗粒度标注数据集，我们构建了CURE基准数据集，用于评估视觉语言模型的零样本推理性能与一致性。通过对现有最优视觉语言模型的评估发现，即使性能最佳的模型也无法展现强大的视觉推理能力与一致性，表明使VLM像人类一样系统且一致地进行视觉推理仍需大量努力。作为初步探索，我们提出两阶段训练框架以提升VLM的推理性能与一致性：第一阶段利用大语言模型自动生成的逐步推理样本对VLM进行监督微调；第二阶段通过引入大语言模型提供的反馈增强训练过程，生成高度一致且可解释的推理链。实验结果充分验证了本框架在推理性能与一致性提升上的有效性。