Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective 'thought dropout' operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across LLM (GSM8K), VLM (CLEVR, Super-CLEVR, GeoQA), and Agentic (AITZ) tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in RL approaches. Our code is available at https://github.com/kokolerk/TON.
翻译:强化学习(RL)已被证明是一种有效的后训练策略,可用于增强视觉语言模型(VLM)的推理能力。群体相对策略优化(GRPO)是近期一种重要的方法,它鼓励模型在回答问题前生成完整的推理轨迹,但这会导致令牌使用量和计算成本增加。受人类思维过程的启发——人们对于简单问题会跳过推理步骤,而在需要时进行仔细思考——我们探索如何让VLM首先判断何时需要进行推理。为实现这一目标,我们提出了TON,一种两阶段训练策略:(i)监督微调(SFT)阶段,采用简单而有效的‘思维丢弃’操作,即随机将推理轨迹替换为空思维。这引入了‘思考与否’的格式,为选择性推理提供了冷启动;(ii)GRPO阶段,使模型能够自由探索何时思考或不思考,同时最大化任务感知的结果奖励。实验结果表明,与原始GRPO相比,TON可将完成长度减少高达90%,且不会牺牲性能甚至有所提升。在LLM(GSM8K)、VLM(CLEVR、Super-CLEVR、GeoQA)和智能体(AITZ)任务上的进一步评估——涵盖3B和7B模型下不同推理难度范围——一致表明,随着训练推进,模型逐渐学会跳过不必要的推理步骤。这些发现为在RL方法中实现类人推理模式指明了方向。我们的代码可在 https://github.com/kokolerk/TON 获取。