The ability of large vision-language models (LVLMs) to critique and correct their reasoning is an essential building block towards their self-improvement. However, a systematic analysis of such capabilities in LVLMs is still lacking. We propose VISCO, the first benchmark to extensively analyze the fine-grained critique and correction capabilities of LVLMs. Compared to existing work that uses a single scalar value to critique the entire reasoning [4], VISCO features dense and fine-grained critique, requiring LVLMs to evaluate the correctness of each step in the chain-of-thought and provide natural language explanations to support their judgments. Extensive evaluation of 24 LVLMs demonstrates that human-written critiques significantly enhance the performance after correction, showcasing the potential of the self-improvement strategy. However, the model-generated critiques are less helpful and sometimes detrimental to the performance, suggesting that critique is the crucial bottleneck. We identified three common patterns in critique failures: failure to critique visual perception, reluctance to "say no", and exaggerated assumption of error propagation. To address these issues, we propose an effective LookBack strategy that revisits the image to verify each piece of information in the initial reasoning. LookBack significantly improves critique and correction performance by up to 13.5%.
翻译:大型视觉语言模型(LVLMs)批判与修正其推理过程的能力,是其实现自我改进的关键基础模块。然而,目前仍缺乏对此类能力的系统性分析。我们提出了VISCO,这是首个用于深入分析LVLMs细粒度批判与修正能力的基准。相较于现有工作使用单一标量值对整个推理链进行整体评价[4],VISCO的特点是密集且细粒度的批判,要求LVLMs评估思维链中每一步的正确性,并提供自然语言解释以支持其判断。对24个LVLMs的广泛评估表明,人工撰写的批判能显著提升修正后的性能,展现了自我改进策略的潜力。然而,模型生成的批判帮助较小,有时甚至对性能有害,这表明批判能力是关键的瓶颈。我们识别了批判失败的三种常见模式:未能批判视觉感知错误、不愿"说不"、以及过度假设错误传播。为解决这些问题,我们提出了一种有效的"回看"(LookBack)策略,即重新审视图像以验证初始推理中的每一条信息。LookBack策略将批判与修正性能显著提升了高达13.5%。