Reinforcement learning (RL) has emerged as a promising approach for eliciting reasoning chains before generating final answers. However, multimodal large language models (MLLMs) generate reasoning that lacks integration of visual information. This limits their ability to solve problems that demand accurate visual perception, such as visual puzzles. We show that visual perception is the key bottleneck in such tasks: converting images into textual descriptions significantly improves performance, yielding gains of 26.7% for Claude 3.5 and 23.6% for Claude 3.7. To address this, we investigate reward-driven RL as a mechanism to unlock long visual reasoning in open-source MLLMs without requiring costly supervision. We design and evaluate six reward functions targeting different reasoning aspects, including image understanding, thinking steps, and answer accuracy. Using group relative policy optimization (GRPO), our approach explicitly incentivizes longer, structured reasoning and mitigates bypassing of visual information. Experiments on Qwen-2.5-VL-7B achieve 5.56% improvements over the base model, with consistent gains across both in-domain and out-of-domain settings.
翻译:强化学习已成为在生成最终答案前引发推理链的一种有前景的方法。然而,多模态大语言模型生成的推理过程缺乏对视觉信息的整合。这限制了它们解决需要精确视觉感知的问题(如视觉谜题)的能力。我们证明视觉感知是此类任务的关键瓶颈:将图像转换为文本描述可显著提升性能,为Claude 3.5带来26.7%的增益,为Claude 3.7带来23.6%的增益。为解决这一问题,我们研究了奖励驱动的强化学习作为一种机制,旨在无需昂贵监督的情况下,解锁开源多模态大语言模型中的长视觉推理能力。我们设计并评估了六种针对不同推理方面的奖励函数,包括图像理解、思维步骤和答案准确性。通过使用组相对策略优化,我们的方法明确激励更长、结构化的推理,并缓解对视觉信息的规避。在Qwen-2.5-VL-7B上的实验表明,相比基础模型实现了5.56%的性能提升,且在领域内和领域外设置中均取得了一致的增益。