ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models

Post-training Large Vision-and-Language Models (LVLMs) typically involves Supervised Fine-Tuning (SFT) for knowledge injection or Reinforcement Learning with Verifiable Rewards (RLVR) for performance enhancement. However, SFT often leads to sub-optimal performance, while RLVR remains constrained by the model's internal knowledge base. While a sequential SFT $\rightarrow$ RLVR pipeline can be used, it introduces significant computational overhead and suffers from catastrophic forgetting. To address these limitations, we propose ViSurf (\textbf{Vi}sual \textbf{Su}pervised-and-\textbf{R}einforcement \textbf{F}ine-Tuning), a unified, single-stage paradigm that integrates the strengths of both SFT and RLVR. By analyzing their training objectives, we establish a unified framework that injects ground-truth labels directly into RLVR rollouts, facilitating simultaneous external supervision and internal reinforcement. Furthermore, we introduce three novel reward control strategies to ensure training stability and optimization. Extensive experiments demonstrate that ViSurf consistently outperforms standalone SFT, RLVR, and the traditional two-stage pipeline across diverse benchmarks. In-depth analysis corroborates these findings, validating the derivation and design principles of ViSurf.

翻译：对大型视觉-语言模型（LVLMs）进行后训练通常涉及监督式微调（SFT）以注入知识，或采用可验证奖励的强化学习（RLVR）以提升性能。然而，SFT往往导致性能欠佳，而RLVR仍受限于模型内部知识库。尽管可采用顺序式SFT→RLVR流程，但该方法会引入显著的计算开销并遭受灾难性遗忘。为克服这些局限，我们提出ViSurf（视觉监督与强化微调），这是一种统一单阶段范式，融合了SFT与RLVR的双重优势。通过分析两者的训练目标，我们构建了一个统一框架，将真实标签直接注入RLVR生成序列中，从而同时实现外部监督与内部强化。此外，我们引入三种新型奖励控制策略以确保训练稳定性与优化效果。大量实验表明，ViSurf在不同基准测试中始终优于独立的SFT、RLVR以及传统两阶段流程。深度分析进一步佐证了这些发现，验证了ViSurf的推导过程与设计原理。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

视觉语言模型泛化到新领域：全面综述

专知会员服务

38+阅读 · 2025年6月27日

高效视觉语言模型研究综述

专知会员服务

14+阅读 · 2025年4月18日

EMNLP2024｜从知识图谱中习得大语言模型的规划能力

专知会员服务

31+阅读 · 2024年11月27日

【CVPR2024】"ViTamin：在视觉-语言时代设计可扩展的视觉模型"

专知会员服务

28+阅读 · 2024年4月4日