Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

Vision-language models (VLMs) have shown remarkable advancements in multimodal reasoning tasks. However, they still often generate inaccurate or irrelevant responses due to issues like hallucinated image understandings or unrefined reasoning paths. To address these challenges, we introduce Critic-V, a novel framework inspired by the Actor-Critic paradigm to boost the reasoning capability of VLMs. This framework decouples the reasoning process and critic process by integrating two independent components: the Reasoner, which generates reasoning paths based on visual and textual inputs, and the Critic, which provides constructive critique to refine these paths. In this approach, the Reasoner generates reasoning responses according to text prompts, which can evolve iteratively as a policy based on feedback from the Critic. This interaction process was theoretically driven by a reinforcement learning framework where the Critic offers natural language critiques instead of scalar rewards, enabling more nuanced feedback to boost the Reasoner's capability on complex reasoning tasks. The Critic model is trained using Direct Preference Optimization (DPO), leveraging a preference dataset of critiques ranked by Rule-based Reward~(RBR) to enhance its critic capabilities. Evaluation results show that the Critic-V framework significantly outperforms existing methods, including GPT-4V, on 5 out of 8 benchmarks, especially regarding reasoning accuracy and efficiency. Combining a dynamic text-based policy for the Reasoner and constructive feedback from the preference-optimized Critic enables a more reliable and context-sensitive multimodal reasoning process. Our approach provides a promising solution to enhance the reliability of VLMs, improving their performance in real-world reasoning-heavy multimodal applications such as autonomous driving and embodied intelligence.

翻译：视觉语言模型（VLM）在多模态推理任务中展现出显著进步。然而，由于幻觉图像理解或未精炼的推理路径等问题，它们仍经常生成不准确或不相关的响应。为应对这些挑战，我们引入 Critic-V，这是一个受 Actor-Critic 范式启发的新型框架，旨在提升 VLM 的推理能力。该框架通过整合两个独立组件来解耦推理过程与评论过程：基于视觉和文本输入生成推理路径的“推理器”，以及提供建设性评论以精炼这些路径的“评论家”。在此方法中，推理器根据文本提示生成推理响应，该响应可基于评论家的反馈作为策略迭代演进。这一交互过程在理论上由强化学习框架驱动，其中评论家提供自然语言评论而非标量奖励，从而实现更细致的反馈，以提升推理器在复杂推理任务上的能力。评论家模型使用直接偏好优化（DPO）进行训练，利用基于规则奖励（RBR）排序的评论偏好数据集来增强其评论能力。评估结果表明，Critic-V 框架在 8 个基准测试中的 5 个上显著优于现有方法（包括 GPT-4V），尤其在推理准确性和效率方面。结合推理器的动态文本策略与偏好优化评论家的建设性反馈，实现了更可靠、上下文敏感的多模态推理过程。我们的方法为增强 VLM 的可靠性提供了一种有前景的解决方案，可提升其在自动驾驶和具身智能等现实世界中对推理要求较高的多模态应用中的性能。