This paper investigates rule-based reinforcement learning (RL) fine-tuning for visual classification using multi-modal large language models (MLLMs) and the role of the thinking process. We begin by exploring \textit{CLS-RL}, a method that leverages verifiable signals as rewards to encourage MLLMs to 'think' before classifying. Our experiments across \textbf{eleven} datasets demonstrate that CLS-RL achieves significant improvements over supervised fine-tuning (SFT) in both base-to-new generalization and few-shot learning scenarios. Notably, we observe a 'free-lunch' phenomenon where fine-tuning on one dataset unexpectedly enhances performance on others, suggesting that RL effectively teaches fundamental classification skills. However, we question whether the explicit thinking, a critical aspect of rule-based RL, is always beneficial or indispensable. Challenging the conventional assumption that complex reasoning enhances performance, we introduce \textit{No-Thinking-RL}, a novel approach that minimizes the model's thinking during fine-tuning by utilizing an equality accuracy reward. Our experiments reveal that No-Thinking-RL achieves superior in-domain performance and generalization capabilities compared to CLS-RL, while requiring significantly less fine-tuning time. This underscores that, contrary to prevailing assumptions, reducing the thinking process can lead to more efficient and effective MLLM fine-tuning for some visual tasks. Furthermore, No-Thinking-RL demonstrates enhanced performance on other visual benchmarks, such as a 6.4\% improvement on CVBench. We hope our findings provides insights into the impact of thinking in RL-based fine-tuning.
翻译:本文研究了利用多模态大语言模型进行视觉分类时,基于规则的强化学习微调方法以及思维过程的作用。我们首先探索了 \textit{CLS-RL} 方法,该方法利用可验证的信号作为奖励,鼓励 MLLMs 在分类前进行“思考”。我们在 \textbf{十一个} 数据集上的实验表明,在基础到新类别的泛化以及少样本学习场景中,CLS-RL 相较于监督微调均取得了显著提升。值得注意的是,我们观察到了一个“免费午餐”现象,即在一个数据集上进行微调意外地提升了在其他数据集上的性能,这表明强化学习有效地教授了基础的分类技能。然而,我们质疑显式思维——作为基于规则的强化学习的一个关键方面——是否总是有益或不可或缺。为了挑战复杂推理能提升性能的传统假设,我们提出了 \textit{No-Thinking-RL},这是一种新颖的方法,它通过利用等精度奖励来最小化模型在微调过程中的思考。我们的实验表明,与 CLS-RL 相比,No-Thinking-RL 在域内性能和泛化能力方面均表现更优,同时所需的微调时间显著减少。这强调了,与普遍假设相反,减少思维过程对于某些视觉任务而言,可以带来更高效、更有效的 MLLM 微调。此外,No-Thinking-RL 在其他视觉基准测试中也表现出增强的性能,例如在 CVBench 上实现了 6.4\% 的提升。我们希望我们的发现能为理解基于强化学习的微调中思维过程的影响提供见解。