思考与否？基于强化学习的视觉语言模型选择性推理 (Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models)

Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective 'thought dropout' operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across LLM (GSM8K), VLM (CLEVR, Super-CLEVR, GeoQA), and Agentic (AITZ) tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in RL approaches. Our code is available at https://github.com/kokolerk/TON.

翻译：强化学习（RL）已被证明是一种有效的后训练策略，可用于增强视觉语言模型（VLM）的推理能力。群体相对策略优化（GRPO）是近期一种重要的方法，它鼓励模型在回答问题前生成完整的推理轨迹，但这会导致令牌使用量和计算成本增加。受人类思维过程的启发——人们对于简单问题会跳过推理步骤，而在需要时进行仔细思考——我们探索如何让VLM首先判断何时需要进行推理。为实现这一目标，我们提出了TON，一种两阶段训练策略：（i）监督微调（SFT）阶段，采用简单而有效的‘思维丢弃’操作，即随机将推理轨迹替换为空思维。这引入了‘思考与否’的格式，为选择性推理提供了冷启动；（ii）GRPO阶段，使模型能够自由探索何时思考或不思考，同时最大化任务感知的结果奖励。实验结果表明，与原始GRPO相比，TON可将完成长度减少高达90%，且不会牺牲性能甚至有所提升。在LLM（GSM8K）、VLM（CLEVR、Super-CLEVR、GeoQA）和智能体（AITZ）任务上的进一步评估——涵盖3B和7B模型下不同推理难度范围——一致表明，随着训练推进，模型逐渐学会跳过不必要的推理步骤。这些发现为在RL方法中实现类人推理模式指明了方向。我们的代码可在 https://github.com/kokolerk/TON 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日